Mastering Statistics: Fundamentals of Data Analysis.
  • Home
  • Math background
  • Probability
  • Exploratory Analysis
  • Supervised Learning
    • Statistical Tests
    • Titanic exercise
    • Linear Models
    • Assessing Model Accuracy
    • Classification problems
    • Polynomial Regression
    • Survival Analysis
    • Tree Based Methods
    • Support Vector Machines
    • Deep Learning with TensorFlow
    • Installing TensorFlow
    • Deep Learning
    • Deep Learning Lab with Torch
  • Unsupervised Learning
  • Matrices
    • Matrix Algebra
    • Matrices in Logistic Regression in Neural Networks

Contents

  • 1 Introduction to Inference
  • 2 Expected Value and Standard Error:
    • 2.0.1 Expected value and standard error for the sum
    • 2.0.2 Expected value and standard error for percentages.
    • 2.1 The sampling distribution
      • 2.1.1 Probability histogram
      • 2.1.2 Empirical histogram
      • 2.1.3 Probability histogram of the sample
  • 3 Central limit theorem (CLT)
  • 4 Significance test vs Confidence intervals
  • 5 Confidence intervals
    • 5.1 Manual calculation:
    • 5.2 Getting the confidence intervals from a test result
      • 5.2.1 Interpreting confidence intervals: practical example
  • 6 Significance testing
    • 6.0.1 One sided vs two sided test.
    • 6.0.2 Common errors in significance testing
    • 6.0.3 Statistical power
    • 6.0.4 Student’s t test
  • 7 Categorical data
    • 7.1 Confidence Intervals for Proportions.
      • 7.1.1 Non-binary categorical variables
      • 7.1.2 Few observations for each outcome
    • 7.2 Significance testing for proportions
    • 7.3 Goodness of Fit
  • 8 Two-Sample testing
    • 8.1 Significance testing for Two-Samples data
      • 8.1.1 z-test
      • 8.1.2 The Welch Two Sample t-test
    • 8.2 Confidence intervals for Two-Samples data.
      • 8.2.1 Pooled estimate:
      • 8.2.2 Pooled standard deviation:
    • 8.3 Paired-t test
    • 8.4 The sign test
    • 8.5 Wilcoxon Rank Sum Test
    • 8.6 Two-samples of binary data
    • 8.7 Two-samples of categorical variables
      • 8.7.1 Testing homogeneity
      • 8.7.2 Independence Testing of Categorical Variables
      • 8.7.3 Comparing the different uses of the chi-square test:
  • 9 Permutation Tests
  • 10 Multiple test corrections
  • 11 Choosing the right test for your data
    • 11.0.1 Statistical assumptions
    • 11.0.2 Types of variables
    • 11.1 Choosing a parametric test: regression, comparison, or correlation
      • 11.1.1 Regression tests
      • 11.1.2 Comparison tests
    • 11.2 Flowchart for parametric tests
    • 11.3 Choosing a non parametric test

Inference

  • Show All Code
  • Hide All Code

  • View Source

1 Introduction to Inference

Drawing conclusions from data is difficult because we almost never have complete information about any variable of interest. For instance, we may have only been able to send a satisfaction survey to a small subset of our customers.

  • The population of a study consists of all possible observations of interest.

  • A sample in a study consists of all the observations we actually have.

We are almost always interested in a population, but only have information about a sample. Statistical inference is the process of reasoning from the one to the other.

  • A parameter is a number that describes a population

  • A statistic is a number that describes a sample

The mean of the eye difference variable in our optical dataset is -0.4939496. The sample mean is found by adding together all the values of a variable in the sample data, and dividing this total by \(n\), the sample size.

If we extract 100 samples with three observations each from our data and visually show their means we can see every time we got a different value, sometimes they will be very close to the target parameter, but sometimes they can be quite far.

Appt Date Patient Name DOB Age IsPrivate IsSubsidised IsSmoker IsDriver TakingMedication Left Eye Value Right Eye Value Optician First Name Optician Last Name eye_difference
2019-08-20T22:16:53Z Carl Walker 1960-04-19T00:00:00Z 60 1 0 0 0 yes 3.843656 4.991663 Elizabeth Montgomery -1.148007
2019-07-24T13:44:16Z Aaron Wheeler 1958-02-15T00:00:00Z 62 1 1 1 1 yes 4.359093 1.861221 Shirley Evans 2.497872
2019-01-22T23:39:12Z Martin Franklin 1993-12-17T00:00:00Z 26 0 0 1 1 no 4.522499 1.495162 Kimberly Cook 3.027337
2019-11-06T05:03:04Z Thomas Parker 2000-03-11T00:00:00Z 20 1 1 0 0 no 1.653130 5.302994 Kenneth Griffin -3.649863
2019-07-20T15:25:28Z Joe Ward 1970-03-06T00:00:00Z 50 1 1 0 1 yes 1.148688 5.438031 Kimberly Cook -4.289344
2019-06-26T12:36:19Z Kenneth Bradley 1941-02-26T00:00:00Z 79 1 0 1 1 no 2.749662 1.294820 Sara Alexander 1.454841
Mean  |   SD | Median |  MAD |   Min |  Max | n_Obs | Skewness | Kurtosis | percentage_Missing
----------------------------------------------------------------------------------------------
-0.49 | 1.84 |  -0.51 | 1.98 | -4.96 | 3.90 | 10000 |     0.02 |    -0.64 |               0.00
[1] -0.7096122

Variability is the natural tendency for a statistic to be different across different samples.

A Biased statistics, on the other hand, misses the mark on average, for example if we increase the value of all our sample means by 1:

Reproducibility and Replicability

Replicating means to get the same conclusion with slightly different samples, procedures, and data analysis methods.

Related to that is the problem of reproducibility. That means to get the same results then you simply use the same data and same methods that were claimed in the analysis.

  • Shortcomings of statistical analysis *

Statistical inference can handle variability, but not bias. Bias comes from incorrect data collection and only good data collection and management practices can prevent bias.

Statistical inference can lead to wildly wrong conclusions when misused or abused for example:

  • Sample bias: not choosing the sample randomly.

  • Undercoverage: your procedure miss out keep portions of the population.

  • Overgeneralization: you need to limit the conclusions that you draw to the population that you have sampled from.

Significance test are particulary easy to abuse:

  • If you run a test just because you’ve spotted a pattern in your data, you’re almost certain to get a positive result even if the trend is just due to random chance. Patterns exist just by chance.

  • p-hacking: repeated testing of the same hypothesis will eventually yield a positive result just by random chance.

  • Confusing significance for importance. Just because a difference is significant does not mean it is important.

No amount of statistical savvy can fully remediate bad data. All of the techniques we’ll discuss in this course are designed to address variability in data, not bias. Before attempting any sort of analysis, you should always take some time to think critically about the overall quality of the data. A few questions to consider:

  • Is the sample representative of the population of interest, or might some subgroups be under or over represented?

  • How was the data collected? Are any of the variables self-reported? Were there points in the process where those collecting the data made subjective judgments?

  • How will conclusions from the data analysis be used? How might they be misinterpreted or abused?

2 Expected Value and Standard Error:

If we are interested in the average height of adult males in USA and we extract an adult male at random, we expect his height to be around the population average, give or take about one standard deviation \(\sigma\) . The expected value is the population average \(\mu\) . The expected value of the sample average is still the population average, but remember that the population mean is a random variable because sampling is a random process, so its mean won’t be exactly the population mean. How far off can we expect this sample mean to the population mean?

The standard error (SE) of a statistic tell you roughly how far off the sample mean will be from its expected value (population mean). It indicates how much the sample mean is expected to vary from the true population mean.

\[ \text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}} \tag{1}\]

where n is the sample size. We can actually use this formula to decide what is the sample size we need to use for a desired accuracy in our statistic.

It plays a similar role as the standard deviation for one observation from the population mean, remember that the Standard deviation measures the dispersion of individual data points around the mean of a dataset. It tells you how spread out the values in a dataset are. In summary, the standard error of an estimate is the standard deviation of the sampling distribution of an estimate.

Exercise:

A town has 10,000 registered voters, of whom 6,000 are voting for the Democratic party. A survey organization is taking a sample of 100 registered voters (assume sampling with replacement). The percentage of Democratic voters in the sample will be around _____, give or take ____. (You may use the fact that the standard deviation is about 0.5)

The percentage of Democratic voters in the sample is equal to the mean of the sample, and so the “around” value is the expected value of the sample mean, which in turn is equal to the population mean, 0.60 or 60%. Similarly, the “give or take” value is the standard error (SE) of the mean, which is equal to \(\frac{\sigma}{\sqrt{n}}\) , where σ is the standard deviation of the population and n is the size of the sample. We’re told that σ is about 0.5 and n is 100, so the standard error of the mean is 0.05 or 5%.

2.0.1 Expected value and standard error for the sum

If we are interested in the sum of n draws \(S_n\), the sum and the average are related by \(S_n = n\bar{x_n}\) so the standard error of the sum: \[ SE(S_n)=\sqrt{n}\sigma \tag{2}\]

which tells us that the variability of the sum of n draws increases at the rate of \(\sqrt{n}\) so while increasing the sample size reduces the standard error of the average, it actually increases the standard error for the sum.

Exercise1 (Expected Value):

You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be $10, $50, or $100. You may use the fact that the standard deviation of the three amounts $10, $50 and $100 is $37.

What is the expected value of the sum of the 100 pledges?

The expected value of the sum of a sample is nμ, where n is the size of the sample and μ is the mean of the population. Here we’re told that n is 100 and μ must be the average of $10, $50, and $100, which is $53.33, so the expected value of the sample sum is $5333.

Exercise2 (Standard error for the sum)

You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be $10, $50, or $100. You may use the fact that the standard deviation of the three amounts $10, $50 and $100 is $37.

What are the chances that the 100 pledges total more than $5,700 ?

The standard error of the sample sum is \(SE(S_n)=\sqrt{n}\sigma\) where n is the sample size and σ is the standard deviation of the population.

Since n is 100 and we’re told that the standard deviation of the population is $37, we know the standard error of the sample sum is $370.

Using the normal approximation to the sampling distribution, this tells us that $5700 is roughly one standard deviation (5700 - 5333 = 366) above the mean of the sampling distribution. From the empirical rule, we know that one half of 68% of the data lies within one standard deviation to the right of the mean, so that 50 + (1/2)68 = 84% of the data lies to the left of $5700. That is, 16% percent of the data lies to right of $5700.

2.0.2 Expected value and standard error for percentages.

For percentages, the expected value is \[E = \mu * 100\% \tag{3}\] and the Standard Error \[SE=\frac{\sigma}{\sqrt{n}} *100\% \tag{4}\]

All the above formulas are for sampling with replacement, but they are still approximately true when sampling without replacement if the sample size is much smaller than the size of the population.


Exercise:

We toss a coin 100 times. How many tails do you expect to see?

we assing tails a value of 1, and heads a value of 0.

p(1)=0.5, p(0)=0.5

Number of expected tails = \(E(sum) = 100 * \mu\)

and \(\mu = 0*p(0)+1*p(1)\) , so \(\mu =0.5\) and \(E= 100*0.5\) =50 tails.

The Standard Error will be \(SE(sum) = \sqrt{100}*\sigma\)

to calculate sigma, we know that \(\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2}\) where:

  • (\(\sigma\)) is the standard deviation,

  • (\(N\)) is the total number of observations,

  • (\(X_i\)) represents each individual observation,

  • (\(\mu\)) is the mean of the observations.

In our exercise, without having to toss the coing 100 times we calculate it like this:

\(\sigma =\sqrt{(x_1-\mu)^{2} * p(1) +(x_2-\mu)^{2}*p(2)}\)

\(\sigma =\sqrt{(0-0.5)^{2} * 0.5 +(1-0.5)^{2}*0.5}=0.5\)

\(\sigma^2 ={(0-0.5)^{2} * 0.5 +(1-0.5)^{2}*0.5}=0.25\)

\(SE(sum) = \sqrt{100}\sigma^2=10*0.25=5\)

so our answer is that we expect to see 50 tails, give or take 5 tails.

and in percentages we would expect to see E= 0.5*100 =50% tails with an standard error of SE =0.5/10*100 = 5%

in r code, an example of the same problem with n=200 tosses. See the difference in the percentages.

Code
# Define the probabilities 
p <- c(0.5, 0.5)  
# Define the values (0 for heads, 1 for tails) 
x <- c(0, 1)  
# Calculate the mean (μ) 
mu <- sum(x * p)  
# Calculate the variance (σ^2) 
variance <- sum((x - mu)^2 * p)  
# Calculate the standard deviation (σ) 
sigma <- sqrt(variance)  
# Number of coin tosses 
n <- 200  
# Calculate the standard error (SE) 
SE <- sqrt(n) * sigma  
 
cat("Expected percentage of tails:", n * mu, "\n") 
Expected percentage of tails: 100 
Code
cat("Standard error:", SE, "\n") 
Standard error: 7.071068 
Code
cat("Expected number of tails:", mu*100, " % \n") 
Expected number of tails: 50  % 
Code
cat("Expected standard error percentage:", sigma /sqrt(n), "% \n")
Expected standard error percentage: 0.03535534 % 

2.1 The sampling distribution

If we toss a fair coin 100 times, we can get any number of tails from 0 to 100. How likely is each outcome?

The number of tails has the binomial distribution with n=100 and p=0.5 where success = tails. So if the statistic of interest is \(S_n=number\ of \ tails\) , then \(S_n\) is a random variable whose probability histogram is given by the binomial distribution. This is called the sampling distribution of the statistic \(S_n\). The sampling distribution provides more detailed information about the chance properties of \(S_n\) than the summary numbers given by the expected value and the standard error alone.

There are three histograms to take into consideration:

2.1.1 Probability histogram

The probability histogram for producing the data. Since both tails and heads have the same probability, our histogram would be two columns with the same height.

  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2.1.2 Empirical histogram

If we toss the coins, we will have am empirical histogram

2.1.3 Probability histogram of the sample

And finally we can do the probability histogram of the statistic \(s_{100}\), which shows the sampling distribution:

3 Central limit theorem (CLT)

The Central Limit Theorem is one of the most important results in statistics, basically saying that with sufficiently large sample sizes, the sample average approximately follows a normal distribution. This underscores the importance of the normal distribution, as well as most of the methods commonly used which make assumptions about the data being normally distributed.

Let’s stop and think about what it means for the sample average to have a distribution. Imagine going to the store and buying a bag of your favorite brand of chocolate chip cookies. Suppose the bag has 24 cookies in it. Will each cookie have the exact same number of chocolate chips in it? It turns out that if you make a batch of cookies by adding chips to dough and mixing it really well, then putting the same amount of dough onto a baking sheet, the number of chips per cookie closely follows a Poisson distribution. (In the limiting case of chips having zero volume, this is exactly a Poisson process.) Thus we expect there to be a lot of variability in the number of chips per cookie. We can model the number of chips per cookie with a Poisson distribution. We can also compute the average number of chips per cookie in the bag. For the bag we have, that will be a particular number. But there may be more bags of cookies in the store. Will each of those bags have the same average number of chips? If all of the cookies in the store are from the same industrial-sized batch, each cookie will individually have a Poisson number of chips. So the average number of chips in one bag may be different from the average number of chips in another bag. Thus we could hypothetically find out the average number of chips for each bag in the store. And we could think about what the distribution of these averages is, across the bags in the store, or all the bags of cookies in the world. It is this distribution of averages that the central limit theorem says is approximately a normal distribution, with the same mean as the distribution for the individual cookies, but with a standard deviation that is divided by the square root of the number of samples in each average (i.e., the number of cookies per bag).

When sampling with replacement and \(n\) is large, then the sampling distribution of the sample average approximately follows the normal curve.

For example, going back to the example of of the online gambling where we had a probability of winning a small prize with \(p=0.2\), we can see how the probability histogram for the binomial distribution approximates more and more to the normal distribution as we increase the number of trials \(n\).

That means that we can use normal approximation to compute probabilities. The key point of the theorem is that we know that the sampling distribution of the statistic is normal no matter what the population histogram is.

The mathematical theory behind that:

The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average \(\bar{y}\) of a random sample follows a normal distribution centered at the population average μ and with standard deviation equal to the population standard deviation σ, divided by the square root of the sample size N. We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error. We have seen this formula before when studying the Standard Error (Equation 1): \[ \text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}} \]

This is telling us that when we take more samples and plot their results in a histogram, the spread of the histogram becomes smaller (closer to the population average) as we saw in the histograms above.

Standardized sample mean

This implies that if we take many samples of size \(N\), then the quantity is approximated with a normal distribution centered at 0 and with standard deviation 1

\[ \frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}} \tag{5}\]

  • \(\bar{Y}\): The sample mean, which is the average of your sample data.

  • \(\mu\): The population mean, which is the average of the entire population from which the sample is drawn.

  • \(\sigma_Y\): The population standard deviation, which measures the spread of the population data.

  • \(N\): The sample size, or the number of observations in your sample.

The standardized sample mean is a way to transform the sample mean into a standard normal distribution (mean 0, standard deviation 1). Standardizing the sample mean by subtracting the population mean and dividing by the standard error \(\sigma_Y/\sqrt{N}\) allows us to compare it to the standard normal distribution. This is useful because the properties of the normal distribution are well understood and widely applicable in statistical inference.

Central limit theorem in practice. The central limit theorem allow us, when we don’t have access to the population data, to use the normal approximation so we can compute \(p\)-values.

We are going to use the mice bodyweight data for our example. We willwork only with female mice because the bodyweight of female and male mice are different. We calculate the mean and the standard deviation of control and treatment groups and those will be the parameteres of our population.

The CLT tells us that when the samples are large, the random variable is normally distributed. Thus we can compute \(p\)-values using the function pnorm

Code
dat <- read.csv("data/mice_pheno.csv") 
table(dat$Sex, dat$Diet)
   
    chow  hf
  F  225 200
  M  224 197
Code
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  dplyr::select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  dplyr::select(Bodyweight) %>% unlist

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
[1] 2.375517
Code
sd_hf <- sd(hfPopulation)
sd_hf
[1] 5.082592
Code
sd_control <- sd(controlPopulation)
sd_control
[1] 3.424056

Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we have to estimate them from samples.

As we described, the CLT tells us that for large N, each of these is approximately normal with average population mean and standard error population variance divided by N. We mentioned that a rule of thumb is that N should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of N.

We are going to create 10,000 samples of N number of mice for N= 3,12,25 and 50 and calculate the difference for the means between treatment and control bodyweight. We then calculate the t-statistic for each sample size (t-statistic is explained later)

Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.This ratio is what we call the t-statistic.

\[ t = \frac{\bar{x} - \mu}{SE} \tag{6}\] where \(SE = \frac{s}{\sqrt{n}}\)

In the specific case of our experiment (comparing the means of two samples) this translates into:

\[ t = \frac{\bar{y} - \bar{x}}{SE}=\frac{\bar{y} - \bar{x}}{\sqrt{\frac{s_y^2}{n} + \frac{s_x^2}{n}}} \]

Finally we will plot a QQ plot for each sample size against the normal distribution to see their fit:

Code
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations

##function to compute a t-stat
computetstat <- function(n) {
  y <- sample(hfPopulation,n)
  x <- sample(controlPopulation,n)
  (mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
}
res <-  sapply(Ns,function(n) {
  replicate(B,computetstat(n))
})
par(mfrow=c(2,2))
for (i in seq(along=Ns)) {
  qqnorm(res[,i],main=Ns[i])
  qqline(res[,i],col=2)
}

So we see that for \(N=3\), the CLT does not provide a usable approximation. For \(N=12\), there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation good.

This simulation only proves that \(N=12\) is large enough in this case, not in general.The further the population distribution is from the normal distribution, the higher the sample size we will require.

For populations that are already normally distributed, even small sample sizes will result in sample means that are approximately normally distributed. For populations that are not normally distributed, larger sample sizes are required for the sample means to approximate a normal distribution.

We will see later that we don’t usually calculate our \(p\)-values like this using the CLT, instead we will use a t-test that will give us a slightly different result. This is to be expected because our CLT approximation considered the denominator of t-stat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.

When does the central limit theorem apply? For the normal approximation to work, the key requirements are:

  • we sample with replacement, or we simulate independent random variables from the same distribution. (If the sample size is much smaller than the population size, then sampling with replacement and sampling without replacement is approximately the same so this also applies)

  • The statistic of interest is a sum, average or percentage.

  • The sample size is large enough. The more skewed the population histogram is, the larger the required sample size n. If there is no strong skewness, then n=15 is sufficient.

Exercise:

Suppose we are interested in the proportion of times we see a 6 when rolling n=100 dice. This is a random variable which we can simulate with x=sample(1:6, n, replace=TRUE) and the proportion we are interested in can be expressed as an average: mean(x==6). Because the die rolls are independent, the CLT applies.

We want to roll n dice 10,000 times and keep these proportions. This random variable (proportion of 6s) has mean p=1/6 and variance \(\frac{p(1-p)}{n}\). So according to the CLT: \[ z = \frac{observed_p - p}{\sqrt{(p\times(1-p))/n)}} \]

\(z\) should be normal with mean 0 and SD 1.

Perform the simulation, and report what proportion of times \(z\) was larger than 2 in absolute value (CLT says it should be about 0.05).

Code
n=100
N=10000
p=1/6
SE<- sqrt(p*(1-p)/n)
set.seed(1)

res<- replicate(N,sample(1:6,n,replace=TRUE))

prop = apply(res, 2, function(x) mean(x == 6))

par(mfrow = c(1,2))
hist(prop, freq=TRUE)

z_scores <- (prop-p)/SE
plot(z_scores,dnorm(z_scores))

Code
proportion_greater_than_2 <- mean(abs(z_scores) > 2)
print(proportion_greater_than_2)
[1] 0.0431

4 Significance test vs Confidence intervals

There are two fundamental sorts of questions we might ask our sample data:

  • How can we estimate a parameter while taking into account the variability of the data in an honest way?

  • How can we know if our data supports a specific conclusion, given its inherent variability?

The first of these questions leads to the idea of a confidence interval, where we specify not only an estimate of a parameter but also a reasonable margin of error. The latter question gives rise to significance testing.

  • Use a confidence interval if you wish to estimate a parameter from a sample in a way that describes not only the observed mean but also the uncertainty surrounding it.

  • Use a significance test if there is one particular value of interest, for instance representing whether a piece of machinery is properly calibrated.

Significance tests and confidence intervals are two sides of the same coin.

Significance tests consider specific individual values, while confidence intervals give more general information about the parameter of interest. The latter are more flexible and are generally easier to interpret. Consider a significance test only when there is a single parameter value of interest which you could identify before looking at the data.

5 Confidence intervals

A confidence interval is a way of reporting an estimate of a parameter that includes information about how much variability could reasonably be expected due to random chance. A confidence interval for the mean of a quantitative variable has the form

\[ \mu = \bar{x} \pm ME \]

where \(\mu\) is the population mean, \(\bar{x}\) is the observed sample mean and \(ME\) is a margin of error.

5.1 Manual calculation:

A confidence interval gives a range of plausible values for a population parameter. Usually the confidence interval is centered at an estimate for \(\mu\) which is an average. Since the central limit theorem applies for averages, the confidence interval has a simple form:

\[ CI = estimate \mp ME \]

where ME is the margin of error and can be decomposed like this:

\[ CI=estimate \mp z\ * SE = \mu \mp z*SE \tag{7}\]

where SE is the standard error of the sample. If that is an average or a percentage then it is \(SE= \frac{\sigma}{\sqrt{n}}\) as we saw in (Equation 1).

And \(z\) is the \(z\)-score corresponding to the desired confidence level. For example, if we would like to have a 95% confidence level, our \(z=1.96\)

This comes from calculating the value of \(z\) where 95 of our data is in the middle:

To get to that \(z\) value we use tables or we can use the quantile function qnorm:

Code
# Calculate the z-score for a 95% confidence interval
z_score <- qnorm(0.975)
z_score
[1] 1.959964

The 0.975value is used because for a 95% confidence interval, you need to capture the central 95% of the distribution, leaving 2.5% in each tail. Therefore, you look up the 97.5th percentile (0.975) to get the \(z\)-score.

For 90% confidence interval we are looking for the \(z\) value in the normalized distribution where 90% of the data falls in our center range and 10% outside, so 5% in each tail.

Code
# Calculate the z-score for a 95% confidence interval
desiredConfidence <- 90
tails = (100-desiredConfidence )/2
percentileOfinterest <- desiredConfidence+tails
z_score <- qnorm(percentileOfinterest/100)
z_score
[1] 1.644854

For a 99 confidence level:

Code
# Calculate the z-score for a 95% confidence interval
desiredConfidence <- 99
tails = (100-desiredConfidence )/2
percentileOfinterest <- desiredConfidence+tails
z_score <- qnorm(percentileOfinterest/100)
z_score
[1] 2.575829

Confidence Intervals for Coin Flips

Example 1: Flipping a Coin 100 Times

In this example of flipping a coin 100 times, we observed 44 heads, resulting in the following 95% confidence interval for \(p\):

\[ CI= \hat{p} \pm z \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

where: - \(\hat{p}\) is the observed proportion of heads - \(z\) is the critical value from the standard normal distribution for a 95% confidence level (approximately 1.96) - \(n\) is the number of trials (100 in this case) $ p = $

Code
n1 <- 100
x1 <- 44
p_hat1 <- x1 / n1
z <- 1.96

# Confidence interval calculation
(ci_lower1 <- p_hat1 - z * sqrt(p_hat1 * (1 - p_hat1) / n1))
[1] 0.3427082
Code
(ci_upper1 <- p_hat1 + z * sqrt(p_hat1 * (1 - p_hat1) / n1))
[1] 0.5372918

p: (0.343, 0.537)

From this, we concluded that it is plausible that the coin may be fair because \(p = 0.5\) is in the interval.

Example 2: Flipping a Coin 100,000 Times

Suppose instead that we flipped the coin 100,000 times, observing 44,000 heads (the same percentage of heads as before). Then using the method just presented, the 95% confidence interval for \(p\) is:

Code
# Example 2: Flipping a Coin 100,000 Times
n2 <- 100000
x2 <- 44000
p_hat2 <- x2 / n2

# Confidence interval calculation
(ci_lower2 <- p_hat2 - z * sqrt(p_hat2 * (1 - p_hat2) / n2))
[1] 0.4369234
Code
(ci_upper2 <- p_hat2 + z * sqrt(p_hat2 * (1 - p_hat2) / n2))
[1] 0.4430766

p: (0.437, 0.443)

Is it reasonable to conclude that this is a fair coin with 95% confidence? Based on this narrower confidence interval, it appears less likely that the coin is fair since \(p = 0.5\) is not within the interval.

Estimating the SE with bootstrap principle

We still have a problem for calculating the confidence interval following this, and it is that we need to know the standard deviation of the population \(\sigma\) and we usually don’t know it, but the bootstrap principle states that we can calculate sigma by its sample version \(s\) and still get an approximately correct confidence interval.

Example: We poll 1000 likely voters and find that 58% approve of the way the president handles his job.

\(SE= \frac{\sigma}{\sqrt{n}} * 100\) where \(\sigma = \sqrt{p(1-p)}\) where \(p\) is the proportion of all voters who approve, but we don’t know p, but the bootstrap principle tells us that we can replace \(\sigma\) by \(s\) so we can plug in the values from our survey here: \(=\sqrt{0.58(1-0.58)} = 0.49\)

So a 95% confidence interval for \(p\) is: \[ 58\% \mp 1.96 \frac{0.49}{\sqrt{1000}}*100 \]

which is approximately [54.9%,61.1%]

The width of the confidence interval is determined by \(z\) and the standard error SE. To reduce that margin of error we have two options, we can increase the sample size or decrease the confidence level. The sample size is square rooted so this means that to cut the width of the confidence level in half we need four times the sample size, and to reduce it 10 times we would need 100 times the sample size.

5.2 Getting the confidence intervals from a test result

Code
#simulate our survey results
success<- rep(1,580)
failure<- rep(0,420)
mydata <- c(success,failure)
#do a te.test.
testResult <- t.test(mydata)
confInt95_low <- testResult$conf.int[1]
confInt95_upp <- testResult$conf.int[2]
confInt95_upp
[1] 0.6106429
Code
cat('confidence interval = [',confInt95_low,confInt95_upp,']')
confidence interval = [ 0.5493571 0.6106429 ]
Code
margin_of_error <- (confInt95_upp - confInt95_low) / 2
cat('margin of error: ',margin_of_error)
margin of error:  0.03064294

For example in our optical dataset, if we do a t.test over that variable we can get the confidence intervals for a level of confidence 95%:

Code
file1 <- here::here("data","optical_sample.xlsx")
optical_sample <- read_excel(file1)

testResult <- t.test(optical_sample$eye_difference)

confInt95_low <- testResult$conf.int[1]
confInt95_upp <- testResult$conf.int[2]

margin_of_error <- (confInt95_upp - confInt95_low) / 2 

ggplot(optical_sample, aes(x = eye_difference)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(optical_sample$eye_difference),
             linetype = "dashed")+
  geom_vline(xintercept = confInt95_low,
           linetype = "dotted", color = "green")+
  geom_vline(xintercept = confInt95_upp,
           linetype = "dotted", color = "green")

the margin of error will be:

Code
margin_of_error <- (confInt95_upp - confInt95_low) / 2 
margin_of_error
[1] 0.3655392

Now we can calculate the confidence intervals for other 90% and 99% level of confidence and see their differences in a graph:

Code
#level of confidence 90%. 
testResult <- t.test(optical_sample$eye_difference,
       conf.level = .90)
confInt90_low <- testResult$conf.int[1]
confInt90_upp <- testResult$conf.int[2]

#level of confidence 99%. 
testResult <- t.test(optical_sample$eye_difference,
                   conf.level = .99)
confInt99_low <- testResult$conf.int[1]
confInt99_upp <- testResult$conf.int[2]

ggplot(optical_sample, aes(x = eye_difference)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(optical_sample$eye_difference),
             linetype = "dashed")+
  geom_vline(xintercept = confInt90_low,
             linetype = "dotted", color = "orange")+
  geom_vline(xintercept = confInt90_upp,
             linetype = "dotted", color = "orange")+
  geom_vline(xintercept = confInt95_low,
           linetype = "dotted", color = "green")+
  geom_vline(xintercept = confInt95_upp,
           linetype = "dotted", color = "green")+
  geom_vline(xintercept = confInt99_low,
             linetype = "dotted", color = "yellow")+
  geom_vline(xintercept = confInt99_upp,
             linetype = "dotted", color = "yellow")

Three factors feed into the size of the margin of error:

  1. The confidence level of the interval. That is, the proportion of the time our interval would correctly capture the parameter of interest. Higher confidence requires a larger margin of error.

  2. The spread of the observations in the data set. More spread in the data implies more sampling variability and therefore a larger margin of error.

  3. The size of the sample.

Exercises:

Using the pm_sample data set,

  1. Find a 95% confidence interval for the mean salary of the project managers in this population. Interpret the results in plain language. Would a 99% confidence interval be wider of more narrow?

  2. Find a 95% confidence interval for the man non-salary compensation of project managers in this population. Why is the margin of error different than in the first problem?

Code
file <- here::here("data", "pm_survey.xlsx") 
pm_survey <- read_excel(file) 

kable(head(pm_survey))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
timestamp how_old_are_you industry job_title additional_context_on_job_title annual_salary other_monetary_comp currency currency_other additional_context_on_income country state city overall_years_of_professional_experience years_of_experience_in_field highest_level_of_education_completed gender race
2021-04-27 11:04:04 25-34 Nonprofits Project Manager NA 66625 1500 USD NA NA US Massachusetts Boston 8 - 10 years 8 - 10 years College degree Woman Hispanic, Latino, or Spanish origin, White
2021-04-27 11:09:34 35-44 Property or Construction Project Manager NA 52000 6000 USD NA NA United States South Carolina Charleston 11 - 20 years 5-7 years High School Woman White
2021-04-27 11:11:41 35-44 Engineering or Manufacturing Project Manager NA 65000 1000 USD NA NA United States Pennsylvania East Stroudsburg 8 - 10 years 2 - 4 years Master's degree Man White
2021-04-27 11:12:58 45-54 Property or Construction Project Manager NA 52800 3600 USD NA NA USA Arizona Phoenix 21 - 30 years 11 - 20 years High School Woman White
2021-04-27 11:13:28 25-34 Nonprofits Project Manager Environmental Health 65818 1000 USD NA NA United States Illinois Chicago 5-7 years 5-7 years College degree Woman White
2021-04-27 11:14:04 25-34 Manufacturing Project Manager NA 66250 6000 USD NA NA United States Tennessee Nashville 8 - 10 years 5-7 years Master's degree Man White
Code
as.data.frame(report(pm_survey$annual_salary))
Mean     |       SD |   Median |      MAD |      Min |      Max | n_Obs | Skewness | Kurtosis | percentage_Missing
------------------------------------------------------------------------------------------------------------------
87653.22 | 32751.30 | 82000.00 | 24907.68 | 28000.00 | 2.80e+05 |   221 |     2.02 |     7.01 |               0.00
Code
t.test(pm_survey$annual_salary) 

    One Sample t-test

data:  pm_survey$annual_salary
t = 39.786, df = 220, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 83311.35 91995.08
sample estimates:
mean of x 
 87653.22 

The confidence interval at the (default) 95% is 83311.3539636, 91995.0804255 which means in plain language that if we were to repeat the sampling and the test multiple times, this confidence interval would capture the true value of the parameter 95% of the time. Although not extrictly true, we can say that there’s a 95% chance that this confidence interval includes the true population mean.

For the second exercise

Code
pm_survey <- pm_survey |>    
  mutate(other_monetary_comp = replace_na(other_monetary_comp, 0))  

as.data.frame(report(pm_survey$other_monetary_comp))
Mean    |      SD |  Median |     MAD |  Min |      Max | n_Obs | Skewness | Kurtosis | percentage_Missing
----------------------------------------------------------------------------------------------------------
5353.81 | 9676.19 | 1500.00 | 2223.90 | 0.00 | 70000.00 |   221 |     3.32 |    13.78 |               0.00
Code
t.test(pm_survey$other_monetary_comp)  

    One Sample t-test

data:  pm_survey$other_monetary_comp
t = 8.2253, df = 220, p-value = 0.00000000000001693
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 4071.026 6636.585
sample estimates:
mean of x 
 5353.805 

our confidence interval is 4071.0255904, 6636.5852693 and so, the margin of error in both examples are:

Code
monetary_marginError<- (t.test(pm_survey$annual_salary)$conf.int[2] - 
                          t.test(pm_survey$annual_salary)$conf.int[1]) / 2 
monetary_marginError
[1] 4341.863
Code
nonMonetary_marginError<- (t.test(pm_survey$other_monetary_comp)$conf.int[2] - 
                             t.test(pm_survey$other_monetary_comp)$conf.int[1]) / 2
nonMonetary_marginError
[1] 1282.78

The difference in both is due to the variability of the data, that we can calculate using the standar deviation:

Code
sd(pm_survey$annual_salary) 
[1] 32751.3
Code
sd(pm_survey$other_monetary_comp)
[1] 9676.192

5.2.1 Interpreting confidence intervals: practical example

We are going to calculate the confidence interval for 100 samples of 30 observations calculated over the same population and plot it on a graph with a line marking the true mean. We will see how some of them (for a 95% confidence level it should be around 5% of the times) will still miss the population parameter:

Code
#calculate the sample mean and confidence interval from a sample of 100 
low = numeric()
high = numeric()
for (n in 1:100){
  sample <- slice_sample(optical, n = 30)
  test <- t.test(sample$eye_difference)
  low[n] <- test$conf.int[1]
  high[n] <- test$conf.int[2]
}

ci_reps <- data.frame("replicate" = 1:100,
                      low,
                      high)
ggplot(ci_reps, aes(x = low, 
                    xend = high,
                    y = replicate, 
                    yend = replicate)) + 
  geom_segment() +
  geom_vline(xintercept = mean(optical$eye_difference), 
             linetype = "dashed")

When interpreting a confidence interval, remember that not all values in an interval are equal. The center is always more likely than the ends.

6 Significance testing

A t-significance test is used to consider whether an individual parameter value of interest is plausible in light of sample data.

The situation in which the parameter has that value is referred to as the null hypothesis, and the situation in which it does not is referred as to the alternative hypothesis. For example: could be the difference between left eye value and right eye value actually be zero? Does a certain large company pay better than the national average, or could salaries there just be different through random chance?

A test statistic measures how far away the data in our sample are from what we would expect if the null hypothesis \(H_0\) were true.

The most common test statistic is the z-statistic. It determines how far an observed value is from the expected value, measured in standard errors.We already saw this formula before (?@eq-zscore)

\[ z= \frac{observed - expected}{SE} \tag{8}\]

Observed is a statistic that is appropriate for assessing \(H_0\). For example, if we toss a coin ten times and want to know if it is a fair coin, the appropriate statistic would be the number of tails or the percentage of tails.

Expected and SE are the expected value and the SE of this statistic computed under the assumption that \(H_0\) is true.

Example:

We toss a coin 10 times, and get 7 tails. Using the formulas for the expected values of sums we have: Number of expected tails if the coin is fair:

For a binomial distribution we have: \(E=n * p\) where \(n\) is the number of repetitions and \(p\) is the the probability of success (0,5), and the standard error for binomial scenarios is \(SE = \sqrt{np(1-p)}\) .

\[ expected = 10 * \frac{1}{2} = 5 \]

\[ SE = \sqrt{10}\sqrt{\frac{1}{2} * \frac{1}{2}} =1.58 \]

\[ z= \frac{7-5}{1.58}=1.27 \]

By the central limit theorem, the \(p\)-value can be computed with normal approximation.

Code
z <- 1.27

# Calculate the cumulative probability up to z
(p_value <- (1 - pnorm(z))*2)
[1] 0.2040846

In other words, we have a 20.4% probability of observing 7 tails in 10 tosses given that the coin is fair.

If the \(z\)-value is large, that means that there is a great difference between the observed and the expected, so large values of \(z\) are evidence against \(H_0\). The strength of the evidence is measured by the \(p\)-value or observed significance level.

The \(p\)-value is the probability of getting a value of \(z\) as extreme or more extreme than the observed, assuming the null hypothesis is true. As we can see in the graph below, as \(z\) becomes larger, there is less probability of finding data outside of its limits, so the \(p\)-value will be smaller. We multiply the \(p\) value times 2 because we have to consider the two tails of the graph for this experiment.

The \(z\)-score tells you how far an observed value is from the expected value in terms of standard errors. The \(p\)-value tells you the probability of observing a test statistic as extreme as, or more extreme than, the observed statistic, under the null hypothesis.

The result of a significance test is a \(p\)-value, which measures how plausible the value of interest is given the sample data. A \(p\)-value close to zero indicates the value isn’t compatible with the data. As a general rule, if p<0.05, the value can be considered questionable.

Ideally, you should identify a value of interest (the technical term is null hypothesis) before collecting the data. While this isn’t always done in practice, doing so helps prevent wrong conclusions that come from the shotgun effect

If you run a test because you observed an interesting pattern in your sample data, the overall chance of encountering a significant result increases. Data is like firing a shotgun with many pellets—some are bound to hit the target purely due to randomness, if you create your null theory based on a characteristic observed in your sample you are likely going to find significance, even if that is not characteristics of the overall population and it was there just by chance.

While a \(p\)-value might tell you that a parameter is different from a hypothesized value, it won’t ever say how different it might be or if that difference is important.

A t-significance test is appropriate for quantitative data under the exact same circumstances as a t-confidence interval: unless the sample is very small and the data is highly non-symmetric. If the data is relatively symmetric and there are no extreme outliers, n=10 is usually enough. Regardless of the shape of the data, n=30 is a safe threshold.

The value of \(\mu\) (mu) is the value we want to test against (our null hypothesis)

Code
file1 <- here::here("data", "optical_sample.xlsx")
optical_sample <- read_excel(file1)

testResult <- t.test(optical_sample$eye_difference, mu = 0) 

testResult$p.value
[1] 0.005840652

\(p\)-value express the probability of observing the values we have in the sample if the eye difference of the population were in fact 0. The smaller the \(p\)-value, the smaller the probability. Usually we take the threshold of 5% or \(p\)-value <0.05 to reject the null theory, as it means that it is unlikely that the mean observed in our sample can come from a population whose true mean is 0.

We can change the null theory to whatever value we want to measure against. For example, can we say that the average age of the population is NOT 59?

Code
testResult <- t.test(optical_sample$Age, mu = 59)
testResult$p.value 
[1] 0.1078985

our \(p\)-value is {r}testResult$p.value which does not allow us to reject the null theory.

Quick Facts

  1. A \(p\)-value below 0.05 doesn’t mean there’s a 5% chance the null hypothesis is true. Instead, it means if the null were true, there’s a 5% chance of observing data as extreme as, or more extreme than the one that was observed in the sample.

  2. Statistical power, the ability to detect a true effect is often overlooked. A study can have a high chance of missing real effects (Type II error) even if its \(p\)-value threshold is stringent.

  3. In proportional analysis, Simpson’s Paradox can occur where a trend seen in several groups reverses when the groups are combined. It underscores the importance of scrutinizing aggregated data.

6.0.1 One sided vs two sided test.

One has to carefully consider whether the alternative should be one-sided or two-sided test (we are considering the values on the left of -z and the right of z) or only the values on one side. If we are considering a two sided test, the \(p\)-value gets doubled. It is not ok to change the alternative afterwards in order to get the \(p\)-value under 5%.

Deciding whether to use a one-sided or two-sided t-test depends on the hypothesis you want to test. Here are some guidelines to help you make that decision:

Use a one-sided t-test when you have a specific direction in mind for your hypothesis. This means you are testing whether the mean is either greater than or less than a certain value, but not both.

Examples:

  • Greater than: You want to test if the mean score of a new teaching method is greater than the mean score of the traditional method.

  • Less than: You want to test if the mean time to complete a task using a new software is less than the mean time using the old software.

Use a two-sided t-test when you are interested in any difference from the specified value, regardless of direction. This means you are testing whether the mean is different from a certain value, without specifying the direction of the difference.

Examples:

  • You want to test if the mean weight of a sample of apples is different from a known standard weight.

  • You want to test if the mean score of a new drug is different from the mean score of a placebo.

How to Decide

  1. Define Your Research Question: Clearly state what you are trying to find out.

  2. Determine the Direction of Interest: Decide if you are only interested in deviations in one direction (one-sided) or in both directions (two-sided).

  3. Formulate Your Hypotheses: Based on your research question and direction of interest, formulate your null and alternative hypotheses.

Example Scenario

Suppose you are testing a new drug and want to know if it has a different effect on blood pressure compared to a placebo. If you only care whether the drug lowers blood pressure, you would use a one-sided test. If you care whether the drug either lowers or raises blood pressure, you would use a two-sided test.

Code
# Sample data
sample_data <- c(78, 82, 85, 90, 76, 79, 81, 77, 74, 88)

# Perform one-sided t-test
t_test_result <- t.test(sample_data, mu = 75, alternative = "greater")

# Print the result
print(t_test_result)

    One Sample t-test

data:  sample_data
t = 3.6, df = 9, p-value = 0.002874
alternative hypothesis: true mean is greater than 75
95 percent confidence interval:
 77.94481      Inf
sample estimates:
mean of x 
       81 
Code
# Perform two-sided t-test
t_test_result <- t.test(sample_data, mu = 75, alternative = "two.sided")

# Print the result
print(t_test_result)

    One Sample t-test

data:  sample_data
t = 3.6, df = 9, p-value = 0.005748
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
 77.22974 84.77026
sample estimates:
mean of x 
       81 

6.0.2 Common errors in significance testing

At the start of a significance test, the hypothesized value might be true or false. At the end, it may be rejected or not, but we still don’t know for sure if \(H_0\) is true or not. If the null hypothesis holds, then we say that it is a true null hypothesis otherwise is a false null hypothesis Altogether, there are four possible combinations of these outcomes:

\(H_0\) is true \(H_0\) is false
\(H_0\) is rejected Type I error -false positive- True Negative
\(H_0\) is not rejected True Positive Type II error -False negative -

We call the event of rejecting the null hypothesis, when it is in fact true, a Type I error, we call the probability of making a Type I error, the Type I error rate, and we say that rejecting the null hypothesis when the \(p\)-value is less than \(\alpha\), controls the Type I error rate so that it is equal to \(\alpha\)

We may observe a \(p\)-value higher than alpha even when the null hypothesis is false. This is a Type II error. We ask ourselves this question: How big does \(N\) have to be in order to detect that the absolute value of the difference is greater than zero? Type II error control plays a major role in designing data collection procedures before you actually see the data, so that you know the test you will run has enough sensitivity or power. Power is 1 minus Type II error rate, or the probability that you will reject the null hypothesis when the alternative hypothesis is true.

There are several aspects of a hypothesis test that affect its power for a particular effect size. Intuitively, setting a lower \(\alpha\) decreases the power of the test for a given effect size because the null hypothesis will be more difficult to reject (for example from 0.05 to 0.01). This means that for an experiment with fixed parameters (i.e., with a predetermined sample size, recording mechanism, etc), the power of the hypothesis test trades off with its Type I error rate, no matter what effect size you target.

::: {.callout-orange, appearance=“simple”, icon=“false”} Testing a Hypothesis

Conducting a hypothesis test typically proceeds in four steps. First, we define the null and alternative hypotheses. Next, we construct a test statistic that summarizes the strength of evidence against the null hypothesis. We then compute a \(p\)-value that quantifies the probability of having obtained a comparable or more extreme value of the test statistic under the null hypothesis. Finally, based on the \(p\)-value, we decide whether to reject the null hypothesis.

Step 1: Define the Null and Alternative Hypotheses.

The null hypothesis \(H_0\) is the default state of belief about the world. The alternative hypothesis \(H_a\) represent something different. We can think about rejecting the null hypothesis as making a discovery about our data, but if we fail to reject it, we do not know whether we failed to reject it because or sample was too small or whether we failed to reject it because the null hypothesis holds.

Step 2. Construct the Test Statistic

The way in which we construct our test statistic T depends on the nature of the null hypothesis that we are testing.

Step 3. Compute the \(p\)-value

The \(p\)-value is the area under the density function of the null distribution that falls to the right of the T statistic that we calculated in the previous step, and so, gives us the probability of observing a value such us or bigger than the T-Statistic under the assumption that the null hypothesis holds. In general, most commonly-used test statistic follow a well-known statistical distribution under the null hypothesis, such as normal distribution, t-distribution, a \(X^2\) distribution or an \(F\) distribution, provided that the sample size is sufficiently large and that some other assumptions hold.

The \(p\)-value allow us to transform our test statistic, which is measured on some arbitrary and uninterpretable scale, into a number between 0 and 1 that can be easily interpreted.

Step 4. Decide whether to reject the null hypothesis.

Once we have computed the \(p\)-value, it remains for us to decide whether or not to reject the null hypothesis. If the \(p\)-value is sufficiently small, we want to reject it, but deciding what value is sufficiently small is up to us. :::

6.0.3 Statistical power

Statistical power is the ability of a test to detect a specified effect. A test with low power is unlikely to rule out a hypothesized value, even if the value is false. That is, it has a high probability of type II error. Increasing the power of an inference technique requires a trade-off of one sort or another. In practice, this usually means using a larger sample. A preliminary power analysis can give decision-makers information about the study size needed to detect an effect of interest.

A very simple example of how to do that: study planners can estimate the sample size needed to reduce the margin of error in the estimate of a population proportion using this formula: \[ n = \left( \frac{1.96}{2E} \right)^2 \]

where \(E\) is the required margin of error. According to this formula, a simple political poll requiring a 3% margin of error would need a sample size of approximately n=1067 respondents.

We are going to use our mice dataset to see how we can calculate the power of our t test for type II errors. We consider that the data in the file is our entire population. If we look at the difference average weight for control vs treatment we appreciate that there is in fact a difference or around 9%:

Code
dat <- read.csv("data/mice_pheno.csv") 
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  dplyr::select(Bodyweight) %>% unlist

hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  dplyr::select(Bodyweight) %>% unlist

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
[1] 2.375517
Code
print((mu_hf - mu_control)/mu_control * 100) #percent increase
[1] 9.942157

Depending on our sample size, this difference will be perceived by our test or not. Let’s see an example with sample size = 12 and alpha = 0.05. We will do first a single test and see that we cannot reject the null hypothesis based on the \(p\)-value:

Code
set.seed(1)
N <- 12
hf <- sample(hfPopulation,N)
control <- sample(controlPopulation,N)
t.test(hf,control)$p.value
[1] 0.2506346

If we ran this experiment multiple times we can get the percentage of times that our \(p\)-value manage to actually reject the null hypothesis:

Code
repetitions<-2000
N<- 12
alpha<- 0.05
reject<- function(N, alpha=0.05){
  hf <- sample(hfPopulation, N)
  control <- sample(controlPopulation, N)
  pval<- t.test(hf,control)$p.value
  pval < alpha
}

rejections<- replicate(repetitions,reject(N))
mean(rejections)
[1] 0.214

In this case we see that 25% of the time we correctly rejected the null hypothesis and this is the power of our test. To increase this percentage we should increase the sample size. In the next code section we are going to run the same simulation for various values of N

Code
Ns<- seq(5,50,5)
power <- sapply(Ns,function(N){
  rejections<- replicate(repetitions,reject(N))
  mean(rejections)
})
plot(Ns,power, type="b")

There are on the internet several tools that allow you to calculate the the power you need for a particular standard deviation, sizes of n or the effect size you want to detect.

When performing a hypothesis test over a large number of data points, the sample size can significantly impact the \(p\)-value.

Influence of Sample Size on the p-value:

Increased Sensitivity: A larger sample size increases the power of the test, meaning it’s more likely to detect a true effect if one exists. This is because with more data, we get a clearer picture of the population, reducing the standard error.

Reduced Standard Error: The standard error of the mean decreases as the sample size increases. This makes the test statistic (such as the \(t\)-score or \(z\)-score) larger if there is an effect, which can lead to a smaller \(p\)-value.

Smaller p-values: With a large sample size, even small differences from the null hypothesis can result in statistically significant \(p\)-values. This is because the test becomes more sensitive to detecting small effects.

Risk of Overinterpretation: While small \(p\)-values indicate statistical significance, they do not necessarily imply practical significance. In large datasets, it’s crucial to consider the effect size and practical relevance of the results.

Example: Consider a hypothesis test to determine whether a new drug has a different effect than a placebo. If you have a very large sample size, even a tiny difference in outcomes between the drug and placebo groups may produce a statistically significant \(p\)-value, even if the difference is not clinically meaningful.

Visual Explanation: Imagine plotting the p-value as a function of sample size. Initially, as the sample size increases, the p-value decreases sharply, indicating stronger evidence against the null hypothesis. However, beyond a certain point, the \(p\)-value becomes very small for even minute differences, highlighting the need to interpret results in context.

Large Sample Size: Leads to smaller \(p\)-values, higher test sensitivity, and potential for detecting trivial differences as significant.

Interpret Results Carefully: Always consider effect size and practical significance, not just statistical significance.

Code
# Set seed for reproducibility
set.seed(123)

# Generate synthetic data
small_sample <- rnorm(30, mean = 0.03, sd = 1)
large_sample <- rnorm(2000, mean = 0.03, sd = 1)

# Function to calculate p-values for increasing sample sizes
calculate_p_values <- function(data, max_size) {
  p_values <- vector("numeric", max_size)
  for (i in 2:max_size) {  # Start from 2 to avoid t-test error
    sample_data <- data[1:i]
    test_result <- t.test(sample_data, mu = 0)
    p_values[i] <- test_result$p.value
  }
  return(p_values)
}

# Calculate p-values for both small and large samples
small_sample_p_values <- calculate_p_values(small_sample, length(small_sample))
large_sample_p_values <- calculate_p_values(large_sample, length(large_sample))

# Combine results into a data frame
p_values_df <- data.frame(
  Sample_Size = c(1:length(small_sample), 1:length(large_sample)),
  P_Value = c(small_sample_p_values, large_sample_p_values),
  Sample_Type = rep(c("Small_Sample", "Large_Sample"), c(length(small_sample), length(large_sample)))
)

# Filter out NA values
p_values_df <- p_values_df %>% filter(!is.na(P_Value))

# Plot the p-values with facets
ggplot(p_values_df, aes(x = Sample_Size, y = P_Value, color = Sample_Type)) +
  geom_line() +
  labs(title = "Influence of Sample Size on P-Value",
       x = "Sample Size",
       y = "P-Value",
       color = "Sample Type") +
  theme_minimal() +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
  scale_color_manual(values = c("Small_Sample" = "blue", "Large_Sample" = "green")) +
facet_wrap(~ Sample_Type, scales = "free_x", ncol = 1)+
  ylim(0, 1)

In the graphs above we show how even a very tiny deviation (our data’s \(\mu =0.03\)) from our null hypothesis (\(\mu =0\)) is showing as significant when the sample is above 1500 data points, the graph below shows the test over only a maximum of 30 data points and never detects the same difference as significant.

6.0.4 Student’s t test

So far we have been using the t-test without explaining the student’s t-distribution that is used under the hood. Let’s explain it now.

We will explain the student’s t test with an example: The health guideline for lead in drinking water is a concentration of not more than 15 parts per billion (ppb). Five independent samples from a reservoir average 15.6 ppb. Is this sufficient evicence to conclude that the concentration \(\mu\) in the reservoir is above the standard of 15 ppb?

Our null hypothesis is that no, the concentration is just on the standard \(H_0:\mu=15\) and the alternative hypothesis is that the concentration is higher than 15 ppb.

\[ z= \frac{observed - expected}{SE} = \frac{15.6-15}{SE} \]

SE of average = \(\frac{\sigma}{\sqrt{n}}\) but sigma is unknown. By the boostrap principle we know that we should be able to substitute the standard deviation of the population \(\sigma\) with the standard deviation of our sample \(s\) however for sample size less or equal to 20, then the normal curve is not a good enough approximation to the distribution of the \(z\)-statistic. Rather, an appropriate approximation is the Student’s t-distribution with n-1 degrees of freedom.

In the graph we can see the purple line where the degrees of freedom are high (large sample) is just the normal curve. The rest of the lines are showing how with less degrees of freedom the tails of the curve get bigger, representing higher uncertainty. We saw the sample standard deviation formula (?@eq-sampleStandarDeviation):

\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2} \]

In this case we also need to adjust our confidence interval formula from: \(CI=estimate\mp z\ SE\)

to \[ CI= estimate (\mu) \mp t_{n-1}SE \tag{9}\]

z-test vs t-test

The main differences between a t-test and a \(z\)-test lie in their assumptions and applications:

T-Test

  • Distribution: Uses the Student’s t-distribution.

  • Population Variance: Unknown and estimated from the sample.

  • Sample Size: Typically used for small sample sizes (n < 30).

  • Degrees of Freedom: Required for the calculation.

  • Application: Used when comparing the means of two groups, especially when the sample size is small and the population variance is unknown.

Z-Test

  • Distribution: Uses the standard normal distribution (z-distribution).

  • Population Variance: Known.

  • Sample Size: Typically used for large sample sizes (n > 30).

  • Degrees of Freedom: Not required.

  • Application: Used for hypothesis testing of means and proportions when the sample size is large and the population variance is known.

Key Differences

  1. Distribution:

    • T-test: Student’s t-distribution.

    • \(z\)-test: Standard normal distribution.

  2. Population Variance:

    • T-test: Unknown and estimated from the sample.

    • \(z\)-test: Known.

  3. Sample Size:

    • T-test: Small sample sizes (n < 30).

    • \(z\)-test: Large sample sizes (n > 30).

  4. Degrees of Freedom:

    • T-test: Required.

    • \(z\)-test: Not required.

Example Scenarios

  • T-Test: Comparing the average test scores of two small classes where the population variance is unknown.

  • Z-Test: Testing the average height of a large population where the population variance is known.

T-test are the most commonly used test in real life, but if we meet all the criteria to perform a \(z\)-test we can do it like this in r:

Code
library(BSDA)
# Sample data
sample_data <- c(88, 92, 94, 94, 96, 97, 97, 97, 99, 99, 105, 109, 109, 109, 110, 112, 112, 113, 114, 115)
population_mean <- 100
population_sd <- 15

# Perform a one-sample z-test
z_test_result <- z.test(sample_data, mu = population_mean, sigma.x = population_sd)
print(z_test_result)

    One-sample z-Test

data:  sample_data
z = 0.90933, p-value = 0.3632
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
  96.47608 109.62392
sample estimates:
mean of x 
   103.05 

7 Categorical data

In 1912 the Titanic sank and more than 1500 of the 2229 people on board died. Did the chances of survival depend on the ticket class?

First Second Third Crew
Survived 202 118 178 215
Died 123 167 528 698

This is an example of categorical data. The data are counts for a fixed number of categories. Here the data is tabulated in a 2x4 table. Such table is called a Contingency Table because it shows the survival counts for each category of ticket class, i.e. contingent on ticket class.

7.1 Confidence Intervals for Proportions.

Confidence intervals for proportions provide a range of plausible values for population proportions, enabling estimation with a desired level of confidence.

Proportional analysis techniques allow for the assessment and comparison of proportions across different categories, providing valuable insights into categorical data.

There are any number of statistical methods for computing confidence intervals for proportions. By far the simplest is:

\[ p = \hat{p} \pm \sqrt{\frac{\hat{p} \cdot (1 - \hat{p})}{n}} \tag{10}\]

where \(p\) is the actual population proportion and \(\hat{p}\) is the observed proportion.

As long as the sample size isn’t too small, the difference between methods should be minor. If your sample includes at least 5 of each sort of possible outcomes, you are fine with whatever default method your software uses to calculate the proportions.

In R we will use \(prop.test(n_s , sample size)\) where \(n_s\) is the number of successes.

Practice:

We are going to calculate the proportion of smokers in our population based on our optical_sample data

Code
file1 <- here::here("data", "optical_sample.xlsx")
optical_sample <- read_excel(file1)

as.data.frame(report(optical_sample$IsSmoker))
Mean |   SD | Median |  MAD |  Min |  Max | n_Obs | Skewness | Kurtosis | percentage_Missing
--------------------------------------------------------------------------------------------
0.48 | 0.50 |   0.00 | 0.00 | 0.00 | 1.00 |   100 |     0.08 |    -2.03 |               0.00
Code
successNumber<- sum(optical_sample$IsSmoker)
sampleSize <- NROW(optical_sample$IsSmoker)
testResult <- prop.test(successNumber, sampleSize)
testResult

    1-sample proportions test with continuity correction

data:  successNumber out of sampleSize, null probability 0.5
X-squared = 0.09, df = 1, p-value = 0.7642
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.3798722 0.5816817
sample estimates:
   p 
0.48 
Code
#proportion of success 
proportion<- testResult$estimate %>% print()
   p 
0.48 
Code
conf_int <- testResult$conf.int%>% print()
[1] 0.3798722 0.5816817
attr(,"conf.level")
[1] 0.95

We can use the formula we studied at the beginning of the chapter to manually calculate the confidence intervals as well and see the difference with what the software used:

Code
ci_lower<- proportion - sqrt(proportion*(1-proportion)/sampleSize) 
ci_upper<- proportion + sqrt(proportion*(1-proportion)/sampleSize) 
ci_lower
      p 
0.43004 
Code
ci_upper
      p 
0.52996 

Another way of manually calculating the confidence interval is using a \(z\)-score

In the example below we use 95%: (qnorm(0.975) is the \(z-score\) corresponding for 95% conf. level)

Code
margin_of_error <- qnorm(0.975) * sqrt(proportion * (1 - proportion) / sampleSize)

# Calculate the confidence interval bounds
proportion - margin_of_error
        p 
0.3820802 
Code
proportion + margin_of_error
        p 
0.5779198 

For different confidence levels you will need to find the correct \(z\)-score

7.1.1 Non-binary categorical variables

If our categorical variable is not binary, we still can use this same test to calculate the proportion of one single category against the rest.

For example, to calculate the proportion of women in our pm_survey dataset:

Code
file2 <- here::here("data", "pm_survey.xlsx")
pm_survey <- read_excel(file2)

kable(head(pm_survey))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
timestamp how_old_are_you industry job_title additional_context_on_job_title annual_salary other_monetary_comp currency currency_other additional_context_on_income country state city overall_years_of_professional_experience years_of_experience_in_field highest_level_of_education_completed gender race
2021-04-27 11:04:04 25-34 Nonprofits Project Manager NA 66625 1500 USD NA NA US Massachusetts Boston 8 - 10 years 8 - 10 years College degree Woman Hispanic, Latino, or Spanish origin, White
2021-04-27 11:09:34 35-44 Property or Construction Project Manager NA 52000 6000 USD NA NA United States South Carolina Charleston 11 - 20 years 5-7 years High School Woman White
2021-04-27 11:11:41 35-44 Engineering or Manufacturing Project Manager NA 65000 1000 USD NA NA United States Pennsylvania East Stroudsburg 8 - 10 years 2 - 4 years Master's degree Man White
2021-04-27 11:12:58 45-54 Property or Construction Project Manager NA 52800 3600 USD NA NA USA Arizona Phoenix 21 - 30 years 11 - 20 years High School Woman White
2021-04-27 11:13:28 25-34 Nonprofits Project Manager Environmental Health 65818 1000 USD NA NA United States Illinois Chicago 5-7 years 5-7 years College degree Woman White
2021-04-27 11:14:04 25-34 Manufacturing Project Manager NA 66250 6000 USD NA NA United States Tennessee Nashville 8 - 10 years 5-7 years Master's degree Man White
Code
as.data.frame(report(pm_survey$gender))
n_Entries | n_Obs | n_Missing | percentage_Missing
--------------------------------------------------
5.00      |   221 |         0 |               0.00
Code
pm_survey <- pm_survey |> 
  mutate(across(.cols = c(highest_level_of_education_completed, gender), as.factor))

numberOfWomen<- NROW(pm_survey[pm_survey$gender == 'Woman',])
sampleSize <- NROW(pm_survey)
prop.test(numberOfWomen, sampleSize)

    1-sample proportions test with continuity correction

data:  numberOfWomen out of sampleSize, null probability 0.5
X-squared = 91.24, df = 1, p-value < 0.00000000000000022
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.7653993 0.8701140
sample estimates:
        p 
0.8235294 

7.1.2 Few observations for each outcome

While there are statistical methods for dealing with samples with fewer than 5 or each sort of outcome, you should be conservative about making decisions based on them. That said, there are two methods that might be helpful in such situations:

  • The wilson score interval works well even for observed proportions near zero or one and has other good theoretical properties as well. Unlike other confidence intervals we’ve seen it isn’t symmetric.

  • The rule of three is a great rought-and-ready way to compute a 95% confidence interval for a proportion when no positive results have been observed. If you have failures only, the interval will go from 0 to 3/n where n is the sample size, and if you have observed successes only, the confidence interval will go from 1 - (3/n) to 0

7.2 Significance testing for proportions

We want to know if the proportion of the people taking medication in our sample is the same as the proportion of people in USA taking medication. This data we know is 66% of the USA population so we can plug in that value in our test just like that in the formula using \(p=\) where p, in this case is the proportion for our null theory.

Code
file <- here::here("data", "optical_sample.xlsx")
optical_sample <- read_excel(file)

as.data.frame(report(optical_sample$TakingMedication))
n_Entries | n_Obs | n_Missing | percentage_Missing
--------------------------------------------------
2.00      |   100 |         0 |               0.00
Code
medicated<- sum(optical_sample$TakingMedication == "yes")
sampleSize <- nrow(optical_sample)

testResult <- prop.test(medicated, sampleSize, p = .66)

testResult$p.value
[1] 0.0004955456
Code
testResult$conf.int
[1] 0.3894281 0.5913488
attr(,"conf.level")
[1] 0.95

In our case we can reject the null hypothesis and say that it is not reasonable to believe that this sample was drawn from a population with sample mean of 66%

Exercises

Exercise 1. A media outlet claims that 60% of project managers in the U.S have a college degree as their highest level of education. Is that claim plausible using our pm_survey data set?

Code
file <- here::here("data", "pm_survey.xlsx")
pm_survey <- read_excel(file)

pmWithCollegeDegree <- sum(pm_survey$highest_level_of_education_completed == 'College degree')
sampleSize <- nrow(pm_survey)

testResult <- prop.test(pmWithCollegeDegree, sampleSize, p= 0.6)

testResult$p.value
[1] 0.09662592
Code
testResult$conf.int
[1] 0.4748865 0.6095674
attr(,"conf.level")
[1] 0.95

In this case we cannot reject the null hypothesis.

Exercise 2. Use that data set to construct a 95% confidence interval for the proportion of project managers that are under the age of 35.

Code
pm_under35 <- nrow (pm_survey[pm_survey$how_old_are_you %in% c("18-24","25-34"),])

testResult <- prop.test(pm_under35, sampleSize)
testResult$conf.int 
[1] 0.4390950 0.5742398
attr(,"conf.level")
[1] 0.95
Code
testResult$estimate
        p 
0.5067873 

The confidence interval shows that between 44% and 57% of respondents are under 35

7.3 Goodness of Fit

Consider genetic data where you have two groups of genotypes (A or B) for cases and controls for a given disease. The statistical question is if genotype and disease are associated. As in the examples we have been studying previously, we have two populations (A and B) and then numeric data for each, where disease status can be coded as 0 or 1. So why can’t we perform a t-test? Note that the data is either 0 (control) or 1 (cases). It is pretty clear that this data is not normally distributed so the t-distribution approximation is certainly out of the question. We could use CLT if the sample size is large enough, otherwise, we can use association tests.

Imagine we have 250 individuals, where some of them have a given disease and the rest do not. We observe that 20% of the individuals that are homozygous for the minor allele (group B) have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?

Code
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("A",200),rep("B",50)),
                levels=c("A","B"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up

tab<- table(genotype,disease)
tab
        disease
genotype control cases
       A     180    20
       B      40    10

The typical statistics we use to summarize these results is the odds ratio (OR). We compute the odds of having the disease if you are an “B”: 10/40, the odds of having the disease if you are an “A”: 20/180, and take the ratio: \((10/40) / (20/180)\)

Code
(tab[2,2]/tab[2,1]) / (tab[1,2]/tab[1,1])
[1] 2.25

To compute a \(p\)-value, we don’t use the OR directly. We instead assume that there is no association between genotype and disease, and then compute what we expect to see in each cell of the table under the null hypothesis: ignoring the groups, the probabilty of having the disease according to our data is:

Code
p=mean(disease=="cases")
p
[1] 0.12

according to this probability, under our null hypothesis we expect to see a table like this:

Code
expected <- rbind(c(1-p,p)*sum(genotype=="A"),
                  c(1-p,p)*sum(genotype=="B"))
dimnames(expected)<-dimnames(tab)
expected
        disease
genotype control cases
       A     176    24
       B      44     6

The Chi-square test uses an asymptotic result (similar to the CLT) related to the sums of independent binary outcomes. Using this approximation, we can compute the probability of seeing a deviation from the expected table as big as the one we saw. The \(p\)-value for this table is:

Code
chisq.test(tab)$p.value
[1] 0.08857435

so we would expect to find this difference in disease ratio by chance approximately 8.8% of the times, which is not enough to reject our null hypothesis.

Large Samples, Small \(p\)-values

Reporting only \(p\)-values is not an appropriate way to report the results of your experiment. Studies with large sample sizes will have impressively small \(p\)-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1.

To demonstrate, we recalculate the \(p\)-value keeping all the proportions identical, but increasing the sample size by 10, which reduces the \(p\)-value substantially:

Code
tab<-tab*10
chisq.test(tab)$p.value
[1] 0.000000001219624
  1. Fitting Fallacies: Goodness of fit doesn’t guarantee predictive power. A model can fit past data perfectly yet fail miserably on new, unseen data, highlighting the dangers of overfitting.

  2. The Sample Size Paradox: Doubling your sample size doesn’t necessarily halve the error. In fact, to do so, you’d need to quadruple the sample, given the square root relationship between sample size and margin of error.

  3. Two-Sample Twists: When comparing two samples, it’s possible for each sample’s individual data to appear random, yet their combined data can reveal a distinct pattern or significant difference.

A Goodness-of-fit test, also called chi-squared or Pearson goodness-of-fit test, considers whether a categorical variable has a hypothesized distribution

Warning! A goodness of fit test doesn’t give any information about which specific categories might be out of line with the hypothesized distribution. While you might be able to make an educated guess by looking at the data, this test shouldn’t be used to support that kind of intuition.

In 208 the manufacturer of M&Ms published their last color distribution:

Blue Orange Green Yellow Red Brow
24% 20% 16% 14% 13% 13%

We open several packages of M&Ms and count the colors:

Blue Orange Green Yellow Red Brow
85 79 56 64 58 68

Are these counts consistent with the last published percentages? is there sufficient evidence to claim that the color distribution is different? This question requires a test of goodness-of-fit for the six categories.

Our null hypothesis is that the color distribution is the same. The idea is to compare the observed counts to the numbers we would expect if $H_0 $ is true, so for our sample of 410 M&Ms we would expect:

Blue Orange Green Yellow Red Brow
98.4 82 65.6 57.4 53.3 53.3

We look at the difference of the observed and the expected values, we square that difference and we standardize it by dividing by the expected

\[ \chi^2 =\sum_{categories} \frac{(observed-expected)^2}{expected} \]

\[ \frac{(85 - 98.4)^2}{98.4} + \frac{(79 - 82)^2}{82} + \cdots + \frac{(68 - 53.3)^2}{53.3} = 8.57 \]

Large values of the chi-square statistic \(\chi^2\) are evidence against the null hypothesis. To calculate the \(p\)-value we use the chi-square distribution. The \(p\)-value is the right tail of the \(\chi^2\) distribution with degrees of freedom = number of categories -1.

Code
df <- 5
# Create the chi-square distribution curve
curve(dchisq(x, df = df), from = 0, to = 40, 
      main = paste("Chi-Square Distribution (df =", df, ")"),
      ylab = "Density", lwd = 2, col = "steelblue")

# Draw a vertical line at x = 8.57
abline(v = 8.57, col = "red", lwd = 2, lty = 2)

# Highlight the area to the right of x = 8.57
x_vector <- seq(8.57, 40, length.out = 1000)
y_vector <- dchisq(x_vector, df = df)
polygon(c(8.57, x_vector, 40), c(0, y_vector, 0), 
        col = adjustcolor("red", alpha.f = 0.3), border = NA)

in our example, this \(p\)-value happens to be 12.7%, which does not allow us to reject the null hypothesis. In r we can calculate the \(p\)-value like this:

Code
# Set the chi-square value and degrees of freedom
x <- 8.57
df <- 5

# Calculate the p-value
p_value <- pchisq(x, df, lower.tail = FALSE)

# Print the p-value
p_value
[1] 0.1274943

Exercise:

In the exercise below we want to know if the age population in our project managers survey data matches the distribution of ages for the USA population. We get the distribution of USA from wikipedia and do some adjustments so the ranges matches those in our survey:

Code
file1 <- here::here("data", "pm_survey.xlsx")
pm_survey <- read_excel(file1)

us_ages <- c(.117, .176, .168, .158, .166, .215)

table(pm_survey$how_old_are_you)

     18-24      25-34      35-44      45-54      55-64 65 or over 
         4        108         77         27          4          1 
Code
obs_ages <- as.numeric(table(pm_survey$how_old_are_you))

testResult <- chisq.test(obs_ages, p = us_ages)
testResult %>% report()
Effect sizes were labelled following Funder's (2019) recommendations.

The Chi-squared test for given probabilities / goodness of fit of obs_ages to a
distribution of [: n=25.857, : n=38.896, : n=37.128, : n=34.918, : n=36.686, :
n=47.515] suggests that the effect is statistically significant, and large
(chi2 = 260.52, p < .001; Fei = 0.40, 95% CI [0.35, 1.00])
Code
testResult$p.value
[1] 0.000000000000000000000000000000000000000000000000000003035227

The very small \(p\)-value indicates that is extremely unlikely that our pm_survey data was extracted at random from a population with the us_ages distribution.

As mentioned, the goodness of fit test does not tell us what categories show the discrepancies in the data, if we want to find out what is the difference between our sample range of ages and the USA data we can just subtract the proportions from each and see what categories are sub represented and vice versa

Code
obs_ages / sum(obs_ages)
[1] 0.018099548 0.488687783 0.348416290 0.122171946 0.018099548 0.004524887
Code
obs_ages / sum(obs_ages) - us_ages
[1] -0.09890045  0.31268778  0.18041629 -0.03582805 -0.14790045 -0.21047511

Now we want to see if in our optical sample the customers are evenly distributed between the opticians. If we don’t pass a \(p\) attribute to the chi squared test it will assume you are asking for even distribution between all categories:

Code
file2 <- here::here("data", "optical_full.xlsx")
optical_full <- read_excel(file2)

counts <- as.numeric(table(optical_full$`Optician Last Name`))

testResult<- chisq.test(counts)
testResult %>% report()
Effect sizes were labelled following Funder's (2019) recommendations.

The Chi-squared test for given probabilities / goodness of fit of counts to a
uniform distribution suggests that the effect is statistically not significant,
and tiny (chi2 = 46.44, p = 0.164; Fei = 0.01, 95% CI [0.00, 1.00])
Code
testResult$p.value
[1] 0.1636262

In this case we cannot reject the null hypothesis, which in this case is that the count of patients is evenly distributed among all opticians.

Exercise

Does the distribution of populations of towns in the United States follow Benford’s law? Check using the towns data set.

Code
theme_set(theme_minimal())

file <- here::here("data", "towns.xlsx")
towns <- read_excel(file)

benford <- c(.301, .176, .125, .097, .079, .067, .058, .051, .046)
sum(benford)
[1] 1
Code
#we calculate the frequency of each first digit in the population of towns.
table(towns$first_digit)

   1    2    3    4    5    6    7    8    9 
2966 1746 1243  939  797  656  609  548  496 
Code
digits_freq <- as.numeric(table(towns$first_digit))
digits_freq
[1] 2966 1746 1243  939  797  656  609  548  496
Code
testResult<- chisq.test(digits_freq, p = benford)
testResult %>% report()
Effect sizes were labelled following Funder's (2019) recommendations.

The Chi-squared test for given probabilities / goodness of fit of digits_freq
to a distribution of [: n=3010, : n=1760, : n=1250, : n=970, : n=790, : n=670,
: n=580, : n=510, : n=460] suggests that the effect is statistically not
significant, and tiny (chi2 = 9.24, p = 0.323; Fei = 6.67e-03, 95% CI [0.00,
1.00])
Code
testResult$p.value
[1] 0.3226343
Code
# An illustrative plot
benford_df <- data.frame(distribution = c(rep("Towns", 9), rep("Benford", 9)),
                         first_digit = rep(1:9, 2),
                         frequency = c(digits_freq/sum(digits_freq), benford))

ggplot(benford_df, aes(x = first_digit, 
                       y = frequency,
                       fill = distribution)) + 
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:9) + 
  labs(x = "First digit",
       y = "Relative frequency",
       fill = "Distribution") +
  scale_fill_brewer(palette = "Dark2")

our \(p\)-value indicates that we cannot reject the null hypothesis, meaning in this case that the distribution of the first digit in towns across US is compatible with Benford’s law.

8 Two-Sample testing

Two-sample testing allows for the comparison of two independent groups or populations to assess if there are statistically significant differences between their means, proportions, or other relevant measures.

Confidence intervals for two-sample comparison provide a range of plausible values for the difference in means or proportions, allowing for estimation with a desired level of confidence.

Significance testing for two-sample comparison involves evaluating the evidence against the null hypothesis and determining if the observed differences between groups are statistically significant.

In this simple exercise we are going to generate ratings for two products at random. We have not changed the probability of each ranking so both products should have the same rating so the difference of their mean ratings should be 0 if we have enough samples.

Creating many samples at random, calculating their differences and plotting those differences will show how, although we can see that the bell curve distributes more or less evenly from 0 as expected, many of the samples showed a difference

Code
theme_set(theme_minimal())

set.seed(27)

rating <- sample(1:5, 27, replace = TRUE)
product <- c(rep("A", 15), rep("B", 12))
AB_testing <- data.frame(product,
                         rating)

AB_testing |> 
  group_by(product) |> 
  dplyr::summarize(avg_rating = mean(rating))
# A tibble: 2 × 2
  product avg_rating
  <chr>        <dbl>
1 A             2.2 
2 B             2.83
Code
AB_testing |> 
  group_by(product) |> 
  dplyr::summarize(avg_rating = mean(rating)) |> 
  ggplot(aes(x = product, 
             y = avg_rating,
             fill = product)) + 
  geom_col() +
  scale_fill_brewer(palette = "Dark2") +
  theme(legend.position = "none") +
  labs(x = "Product",
       y = "Average rating")

Code
# repeat the samples 10000 times:
difference <- integer()
for (rep in 1:10000){
  rating <- sample(1:5, 27, replace = TRUE)
  difference[rep] <- mean(rating[1:15]) - mean(rating[16:27]) 
}
qplot(difference, binwidth = .2, xlab = "Difference in ratings")

In reality we will only have access to one of the samples, and we have to be able to tell if it’s reasonable to draw a conclusion about the population just based on the sample or whether or not we should just attribute these sorts of differences to random chance.

8.1 Significance testing for Two-Samples data

8.1.1 z-test

The significance test for two samples uses the null hypothesis that there is no difference in the means of the two population means. The \(p\)-value will be used to reject or not that null hypothesys.

we can use a \(z\)-test for the difference between the two means:

\[ z = \frac{observed\ difference- expected\ difference}{SE\ of\ difference} = \frac{(\bar{x_2} - \bar{x_1})-0}{SE\ of\ difference} \]

our expected difference is 0 because that’s our null hypothesis (no difference). An important fact is that if \(\bar{x_1}\) and \(\bar{x_2}\) are independent, then:

\[ SE(\bar{x_2} - \bar{x_1}) = \sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2} \]

Exercise: two-sample \(z\)-test (proportions.)

Last month, the president’s approval rating in a sample of 1000 likely voters was 55%. This month, a poll of 1,500 likely voters resulted in a rating of 58%. Is this sufficient evidence to conclude that the rating has changed?

\(\hat{p_1} = 55%\) and \(\hat{p_2} = 58%\)

\[ z = \frac{(\hat{p_2}-\hat{p_1})-0}{SE_{diff}}=\frac{(\hat{p_2}-\hat{p_1})-0}{\sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2}} \] The formula for the standard error for the proportion is

\[ SE= \sqrt{\frac{p(1-p)}{n}} \]

Calculate the standard errors:

For \(\hat{p_1}\): \[ SE(\hat{p_1}) = \sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1}} = \sqrt{\frac{0.55 \times 0.45}{1000}}= \sqrt{\frac{0.2475}{1000}} = \sqrt{0.0002475} \approx 0.0157 \]

For \(\hat{p_2}\):

\[ SE(\hat{p_2}) = \sqrt{\frac{\hat{p_2}(1 - \hat{p_2})}{n_2}} = \sqrt{\frac{0.58 \times 0.42}{1500}} = \sqrt{\frac{0.2436}{1500}} = \sqrt{0.0001624} \approx 0.0127 \]

\[ SE_{diff} = \sqrt{(0.0157)^2 + (0.0127)^2} = \sqrt{0.00024649 + 0.00016129} = \sqrt{0.00040778} \approx 0.0202 \]

\[ z = \frac{(0.58 - 0.55) - 0}{0.0202} = \frac{0.03}{0.0202} \approx 1.49 \]

The calculated \(z\)-value is approximately 1.49. To determine if this is statistically significant, you would compare this \(z\)-value to the critical value from the standard normal distribution (typically 1.96 for a 95% confidence level). Since 1.49 is less than 1.96, you would fail to reject the null hypothesis at the 95% confidence level, indicating that there is not sufficient evidence to conclude that the president’s approval rating has changed significantly.

The \(p\)-value can be calculated using standard normal distribution tables or software

Code
pValue <- pnorm(1.49)
pValue
[1] 0.9318879

Since this is a two-tailed test (we are checking if the approval rating has changed, not just increased or decreased), we need to consider both tails of the distribution, so we double our values: The \(p\)-value is calculated as \(2 \times (1 - \text{cumulative probability})\). \(p = 2 \times (1 - 0.9318) = 2 \times 0.0682 = 0.1364\)

The \(p\)-value for the \(z\)-value of 1.49 is approximately 0.1364. Since this \(p\)-value is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means there is not sufficient evidence to conclude that the president’s approval rating has changed significantly.

The exercise above shows how to calculate the \(p\) value for proportions, if we are working with average values instead of proportions the calculation is the same, only thing to consider is that in this case is that Standard deviation of each individual sample will be calculated using the formula for the standard deviation for the mean \(SE(\bar{x_1})= \frac{\sigma_1}{\sqrt{n_1}}= \frac{s_1}{\sqrt{n_1}}\) . If the sample sizes are not large, then the \(p\)-value needs to be computed from the t-distribution instead.

8.1.2 The Welch Two Sample t-test

If we have two sets, one control and one test, their means being \(\mu_t\) and \(\mu_c\) the two-sample t-statistic is defined as:

\[ T = \frac{\hat\mu_t-\hat\mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}} \tag{11}\]

where

\[ s=\sqrt{\frac{(n_t-1)s^2_t+(n_c-1)s^2_c}{n_t+n_c-2}} \]

is an estimator of the pooled standard deviation of the two samples. Here, \(s^2_t\) and \(s^2_c\) are unbiased estimators of the variance of our metric in the treatment and control group, respectively. A large (absolute) value of T provides evidence against \(H_0\)

In the example below we are going to use the attrition dataset to see if the average age of employees in two different departments is different?

Code
file <- here::here("data", "attrition1.xlsx")
attrition1 <- read_excel(file)

kable(head(attrition1))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
41 Yes Travel_Rarely 1102 Sales 1 College Life_Sciences Medium Female 94 High 2 Sales_Executive Very_High Single 5993 19479 8 Yes 11 Excellent Low 0 8 0 Bad 6 4 0 5
49 No Travel_Frequently 279 Research_Development 8 Below_College Life_Sciences High Male 61 Medium 2 Research_Scientist Medium Married 5130 24907 1 No 23 Outstanding Very_High 1 10 3 Better 10 7 1 7
37 Yes Travel_Rarely 1373 Research_Development 2 College Other Very_High Male 92 Medium 1 Laboratory_Technician High Single 2090 2396 6 Yes 15 Excellent Medium 0 7 3 Better 0 0 0 0
33 No Travel_Frequently 1392 Research_Development 3 Master Life_Sciences Very_High Female 56 High 1 Research_Scientist High Married 2909 23159 1 Yes 11 Excellent High 0 8 3 Better 8 7 3 0
27 No Travel_Rarely 591 Research_Development 2 Below_College Medical Low Male 40 High 1 Laboratory_Technician Medium Married 3468 16632 9 No 12 Excellent Very_High 1 6 3 Better 2 2 2 2
32 No Travel_Frequently 1005 Research_Development 2 College Life_Sciences Very_High Male 79 High 1 Laboratory_Technician Very_High Single 3068 11864 0 No 13 Excellent High 0 8 2 Good 7 7 3 6
Code
attrition1 |> 
  group_by(Department) |> 
  dplyr::summarize(avg_age = mean(Age))
# A tibble: 2 × 2
  Department           avg_age
  <chr>                  <dbl>
1 Research_Development    37.0
2 Sales                   36.5
Code
testResult <- t.test(Age ~ Department, 
       data = attrition1)

testResult$p.value
[1] 0.3366682
Code
report(testResult)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of Age by Department (mean
in group Research_Development = 37.04, mean in group Sales = 36.54) suggests
that the effect is positive, statistically not significant, and very small
(difference = 0.50, 95% CI [-0.52, 1.52], t(880.05) = 0.96, p = 0.337; Cohen's
d = 0.06, 95% CI [-0.07, 0.20])

8.2 Confidence intervals for Two-Samples data.

We can use the formula for the standard error of the difference to also do a confidence interval calculation. The confidence interval for \(p_2-p_1\) is \((\hat{p_2}-\hat{p_1}) \mp z \times SE(\hat{p_2}-\hat{p_1})\)

were \(z\) is the \(z\)-score for the confidence level we are interested in.

in our example about the voters approval of the president:

the \(z\)-score value for a confidence level of 95% is 1.959964

\[ 58-55 \mp 1.96 \times 0.0202 \approx [-0.0705,0.0105] \]

If we want to resolve the same exercise using r code it gives similar but not exactly the same results:

Code
successes <- c(0.55 * 1000, 0.58 * 1500)

# Define the sample sizes
sample_sizes <- c(1000, 1500)

# Perform the two-sample z-test for proportions
test_result <- prop.test(successes, sample_sizes)

# Print the result
print(test_result)

    2-sample test for equality of proportions with continuity correction

data:  successes out of sample_sizes
X-squared = 2.0801, df = 1, p-value = 0.1492
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.07051474  0.01051474
sample estimates:
prop 1 prop 2 
  0.55   0.58 

There are many statistical techniques for describing the difference between a single variable across two populations. The most universal is the Welch two-sample confidence interval, which is a statistical technique used to compare means between two independent groups, taking into account the unequal variances often encountered in real-world data, so it valid in nearly all circumstances. A few things to bear in mind:

  • It is a multi-sample procedure, not a multi-variable one. Use it when you’re asking how much two samples differ in a single variable.

  • Like every other statistical tool in this course, it requires that all the observations be independent of one another (not to be used with time-line analysis)

  • Unless the data has extreme outliers or is highly asymmetric, it will give good results when both samples are of size 10 or more. If the samples are of size at least 30, it is fine under all but the most extreme circumstances.

The Welch two-sample confidence interval does not require that the samples are equal in size or that the populations have equal variances, unlike some other procedures.

We are going to use the AB_testing data we generated in the previous section and calculate the confidence intervals for the differences observed in one of our samples

Code
set.seed(27)  
rating <- sample(1:5, 27, replace = TRUE) 
product <- c(rep("A", 15), rep("B", 12)) 
AB_testing <- data.frame(product,rating)  
AB_testing |>    group_by(product)  %>%     
  dplyr::summarize(avg_rating = mean(rating))  
# A tibble: 2 × 2
  product avg_rating
  <chr>        <dbl>
1 A             2.2 
2 B             2.83
Code
as.data.frame(report(AB_testing))  
Variable | n_Obs | n_Missing | n_Entries | percentage_Missing | Mean |   SD | Median |  MAD |  Min |  Max | Skewness | Kurtosis
-------------------------------------------------------------------------------------------------------------------------------
product  |    27 |         0 |      2.00 |               0.00 |      |      |        |      |      |      |          |         
rating   |    27 |         0 |           |                    | 2.48 | 1.55 |   2.00 | 1.48 | 1.00 | 5.00 |     0.44 |    -1.38
Code
testResult<- t.test(rating ~ product, data = AB_testing)  
testResult %>% report()  
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of rating by product (mean
in group A = 2.20, mean in group B = 2.83) suggests that the effect is
negative, statistically not significant, and small (difference = -0.63, 95% CI
[-1.88, 0.61], t(23.25) = -1.05, p = 0.305; Cohen's d = -0.41, 95% CI [-1.17,
0.37])
Code
testResult$conf.int  
[1] -1.8804465  0.6137799
attr(,"conf.level")
[1] 0.95

The confidence interval says that with 95% confidence, the population mean difference is between -1.8804465 and 0.6137799. This is actually saying that the difference in the means could be 0.

If you know or can assume that the variance of the two populations is equal, then you can use var.equal = TRUE in the t.test, and in this case instead of using a Welch Two sample t-test, it will use a Two sample t-test

Code
t.test(rating ~ product,         data = AB_testing,        var.equal = TRUE)  

    Two Sample t-test

data:  rating by product
t = -1.055, df = 25, p-value = 0.3015
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -1.8697428  0.6030761
sample estimates:
mean in group A mean in group B 
       2.200000        2.833333 
Code
testResult<- t.test(rating ~ product, data = AB_testing) 
testResult %>% report()  
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of rating by product (mean
in group A = 2.20, mean in group B = 2.83) suggests that the effect is
negative, statistically not significant, and small (difference = -0.63, 95% CI
[-1.88, 0.61], t(23.25) = -1.05, p = 0.305; Cohen's d = -0.41, 95% CI [-1.17,
0.37])
Code
testResult$conf.int  
[1] -1.8804465  0.6137799
attr(,"conf.level")
[1] 0.95

In the example below we want to use our substance_abuse data set and know if there is a difference in the variable DLA_improvement based on the program that the patient was following. We are assuming that the variance is equal in both cases:

We can change the confidence level manually if we want:

Code
file <- here::here("data", "substance_abuse.xlsx") 
substance_abuse <- read_excel(file) 
substance_abuse$DLA_improvement <- substance_abuse$DLA2 - substance_abuse$DLA1 
t.test(DLA_improvement ~ Program,         data = substance_abuse,        conf.level = .99)

    Welch Two Sample t-test

data:  DLA_improvement by Program
t = 38.23, df = 150.6, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group Intervention and group UsualCare is not equal to 0
99 percent confidence interval:
 0.5539340 0.6350745
sample estimates:
mean in group Intervention    mean in group UsualCare 
               0.591468531               -0.003035714 

Exercises

Exercise 1: Generate a 95% confidence interval for the difference in average monthly income between the research and development and sales departments of the company

Code
testResult <- t.test(MonthlyIncome ~Department, 
                     data= attrition1)
testResult$conf.int
[1] -1166.0385  -189.8011
attr(,"conf.level")
[1] 0.95
Code
report(testResult)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of MonthlyIncome by
Department (mean in group Research_Development = 6281.25, mean in group Sales =
6959.17) suggests that the effect is negative, statistically significant, and
very small (difference = -677.92, 95% CI [-1166.04, -189.80], t(1030.99) =
-2.73, p = 0.007; Cohen's d = -0.17, 95% CI [-0.29, -0.05])

The confidence interval does not include 0, which means that there is a difference in the means of the populations of both groups

Exercise 2: It is reasonable to claim that the montly rate is the same bewteen these two departments?

Code
testResult <- t.test(MonthlyRate ~Department, 
                      data= attrition1,
                      mu =0)
testResult$p.value
[1] 0.6162771
Code
report(testResult)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of MonthlyRate by Department
(mean in group Research_Development = 14284.87, mean in group Sales = 14489.79)
suggests that the effect is negative, statistically not significant, and very
small (difference = -204.93, 95% CI [-1007.25, 597.40], t(858.77) = -0.50, p =
0.616; Cohen's d = -0.03, 95% CI [-0.17, 0.10])

In light of the results we cannot reject the null hypothesis that they have the same monthly rate.

8.2.1 Pooled estimate:

Going back to our recent exercise of the voters’s approval rate, we concluded that the two different surveys did not significantly differed.

Since we could not reject the null hypothesis that the two samples are representing the same approval for the candidate, we can combine the two of them to find a better estimate of our Standard Error

in the first sample we have 0,55 x 1000 voters who approved, in the second sample we have 0.58x 1500 so in total we have 1420 approvals out of 2500 people surveyed, so our ppoled estimate will be \(\frac{1420}{2500}=56.8\%\) so now we can calculate the Standard Error using that new value:

\[ SE(\hat{p_2}-\hat{p_1}) = \sqrt{\frac{0.568(1-0.568)}{1000}+\frac{0.568(1-0.568)}{1500}}=0.02022 \]

8.2.2 Pooled standard deviation:

If we know (or there is reason to assume) that the standard deviation of the two populations is the same \(\sigma_1=\sigma_2\) then we can use the pooled estimate:

\[ s^2_{pooled}=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2} \]

Two-Sample \(z\)-Test vs Two sample T-test vs Welch’s Two-Sample T-Test

Two sample \(z\)-test

When to Use:

  • Known Population Variances: The population variances are known.

  • Large Sample Sizes: Typically used when the sample sizes are large (n > 30).

  • Normal Distribution: Assumes that the data follows a normal distribution.

Two sample T-test

  • Unknown Population Variances: The population variances are unknown and assumed to be equal.

  • Small Sample Sizes: Typically used when the sample sizes are small (n < 30).

  • Normal Distribution: Assumes that the data follows a normal distribution.

Welch’s Two-Sample T-Test

When to Use:

  • Unknown and Unequal Population Variances: The population variances are unknown and not assumed to be equal.

  • Small or Unequal Sample Sizes: Can be used for small or unequal sample sizes.

  • Normal Distribution: Assumes that the data follows a normal distribution.

Summary:

  • Population Variances:

    • Z-Test: Assumes known and equal population variances.

    • Welch’s T-Test: Does not assume equal variances and uses sample variances.

  • Sample Size:

    • Z-Test: Typically used for large sample sizes.

    • Welch’s T-Test: Can be used for small or unequal sample sizes.

8.3 Paired-t test

What do we do when we have two samples, but they are not independent from each other? In this case we cannot use the classical two-sample \(z\)-test or two-sample t-test

We want to answer the question: Do husbands tend to be older than their wives?

Husband’s age Wife’s age age difference
43 41 2
71 70 1
32 31 1
68 66 2
27 26 1

In a scenario like this, even if the samples were independent, the test would not give us a significant difference because the difference between the two pairs is always small, while the difference in the values in each sample (standard deviations) are large and what the two samples test does is to compare the differences to the fluctuation within each population.

In this case we will use the paired t-test. Our \(H_0\) is that the population difference is 0. The formula for this test is:

\[ t=\frac{\bar{d}-0}{SE_{(d)}} \]

where \(\bar{d}\) is the average of the differences,

0 is the expected difference that under our null hypothesis is 0 and \(SE_{\bar{(d)}}\) is the standard error for the difference:

\[ SE_{(d)}= \frac{\sigma_d}{\sqrt{n}} = \frac{s_d}{\sqrt{n}} \]

In the example with our data:

\[ t=\frac{1.4}{\frac{0.55}{\sqrt{5}}}=5.69 \]

and now we have to use a table of a student t-distribution with 4 degrees of freedom to find the area under the curve of the normal distribution to the right of our t value.

we can use R code to calculate it like this:

Code
# Given values
t_value <- 5.69
degreesfreedom <- 4

# Calculate the p-value for a two-tailed test
p_value <- 2 * pt(-abs(t_value), degreesfreedom)
p_value
[1] 0.004711695

In this case our result means that we can reject the null hypothesis.

8.4 The sign test

Image that we don’t know the age difference, we only know if the husbands are older or not. We can follow here the same approach as with a binomial distribution, like the coin toss. In this specific case our null hypothesis \(H_0\) is that half of the husbands are older than their wifes (no difference). We assign labels to the results, for example 1 if the husband is older than the wife and 0 otherwise.

\[ z=\frac{sum\ of\ 1s -\frac{n}{2}}{SE\ of\ sum} =\frac{sum\ of\ 1s -\frac{n}{2}}{\sqrt{n}\times\sigma_{H_0}} \]

in our case we have 5 husbands being older than their wifes, so 5 1s and the standard deviation for our null hypothesis will be \(\frac{1}{2}\) because we expect half the husbands to be older then their wives.

\[ z= \frac{5-\frac{5}{2}}{\sqrt{5}\frac{1}{2}}= 2.24 \]

now we can find the \(p\) value using a table or software:

Code
z_score<- 2.24
# Calculate the p-value for a two-tailed test
p_value <- 2 * (1 - pnorm(z_score))
p_value
[1] 0.02509092

we can see that the result of this test is less significant than the test we did before, this is because we have less data to work with.

If we want to resolve the problem using software we can do the calculations:

Code
# Number of successes (husbands older than wives) 
successes <- 5  
# Total number of pairs 
n <- 5 
# Z-test (Sign test approximation) 
z_score <- (successes - 0.5 * n) / sqrt(0.25 * n) 
z_p_value <- 2 * (1 - pnorm(z_score)) 
z_p_value
[1] 0.02534732

but if we use the corresponding binomial test instead that would apply for this scenario, we get to quite a different result:

Code
# Number of successes (husbands older than wives) 
successes <- 5  
# Total number of pairs 
n <- 5  
# Perform the binomial test 
test_result <- binom.test(successes, n, p = 0.5, alternative = "two.sided")  

print(test_result)

    Exact binomial test

data:  successes and n
number of successes = 5, number of trials = 5, p-value = 0.0625
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4781762 1.0000000
sample estimates:
probability of success 
                     1 

::: {.callout-orange, appearance=“simple”, icon=“false”}

Both the two-sample paired t-test and the sign test are used to compare paired data, but they are applied in different situations based on the assumptions and characteristics of the data.

Two-Sample Paired T-Test

When to Use:

  • Normal Distribution: The differences between the paired observations should be approximately normally distributed.

  • Interval or Ratio Data: The data should be measured on an interval or ratio scale.

  • Parametric Test: This test is parametric, meaning it relies on assumptions about the distribution of the data.

Example:

  • Comparing the blood pressure of patients before and after a treatment.

  • Measuring the weight of individuals before and after a diet program.

Sign test

  1. Paired Observations: When you have paired data (e.g., before and after measurements) and you want to test if there is a consistent difference between the pairs. For example, testing if a treatment has an effect by comparing measurements before and after the treatment.

  2. Median Differences: When you want to test if the median of differences between pairs is zero. This is useful when the data does not meet the assumptions required for parametric tests like the paired t-test. It does not rely on assumptions about the distribution of the data.

  3. Non-Normal Data: When the data does not follow a normal distribution, making parametric tests inappropriate. The sign test does not assume any specific distribution for the data.

  4. Ordinal Data: When the data is ordinal (ranked) rather than interval or ratio. The sign test can handle data that can only be compared as greater than, less than, or equal to.

Example Scenarios

  • Medical Studies: Comparing the effectiveness of a treatment by measuring patient conditions before and after the treatment.

  • Quality Control: Testing if a new manufacturing process consistently produces better results than the old process.

  • Behavioral Studies: Comparing responses before and after an intervention. Comparing the number of days patients feel better before and after a new medication.

Summary:

  • Use a paired t-test when the differences between paired observations are normally distributed and you have interval or ratio data.

  • Use a sign test when the data does not meet the normality assumption, is ordinal, or you prefer a non-parametric approach. :::

8.5 Wilcoxon Rank Sum Test

We learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small \(p\)-value, while the Wilcoxon test does not:

Code
set.seed(779) ##779 picked for illustration purposes
N=25
x<- rnorm(N,0,1)
y<- rnorm(N,0,1)

Create outliers:

Code
x[1] <- 5
x[2] <- 7
cat("t-test pval:",t.test(x,y)$p.value)
t-test pval: 0.04439948
Code
cat("Wilcox test pval:",wilcox.test(x,y)$p.value)
Wilcox test pval: 0.1310212

The basic idea is to 1) combine all the data, 2) turn the values into ranks 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.

Code
library(rafalib)
mypar(1,2)

stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)
abline(h=0)

xrank<-rank(c(x,y))[seq(along=x)]
yrank<-rank(c(x,y))[-seq(along=x)]

stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)

ws <- sapply(x,function(z) rank(c(z,y))[1]-1)
text( rep(1.05,length(ws)), xrank, ws, cex=0.8)

Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values
Code
W <-sum(ws) 

W is the sum of the ranks for the first group relative to the second group. We can compute an exact \(p\)-value for \(W\) based on combinatorics. We can also use the CLT since statistical theory tells us that this W is approximated by the normal distribution. We can construct a \(z\)-score as follows:

Code
n1<-length(x);n2<-length(y)
Z <- (mean(ws)-n2/2)/ sqrt(n2*(n1+n2+1)/12/n1)
print(Z)
[1] 1.523124

Here the Z is not large enough to give us a \(p\)-value less than 0.05. These are part of the calculations performed by the R function wilcox.test.

we are not going to get into mathematical detail about these calculations, but the formula for the \(z\) score here is \[ z=\frac{U - \frac{n_2}{2}}{\sqrt{\frac{n_2(n_1+n_2+1)}{12n_1}}} \] and to perform this test in r we use: wilcox.test(x,y)

8.6 Two-samples of binary data

Two observed proportions for a single binary variable can be compared directly using the two-proportion \(z\)-confidence interval and two-proportion \(z\)-test. These apply when there are at least 5 observations of each type in each of the two groups. The formulas are relatively simple. For example a 95% confidence interval for the difference between proportions is

\[ (p_2-p_1) = (\hat p_2-\hat p_1) \pm 1.96 \sqrt{\frac{\hat p_2(1-\hat p_2)}{n_2}+\frac{\hat p_1(1-\hat p_1)}{n_1}} \]

where \(p_1\) and \(p_2\) are the population proportions, \(\hat p_1\) and \(\hat p_2\) are the observed sample proportions and \(n_1\) and \(n_2\) are the sample sizes. R will use a improved version of this formula when computing the proportions.

In R we will use \(prop.test(n_s , sample size)\) where \(n_s\) is the number of successes.

Examples:

in the attrition dataset. Are the attrition proportions different for the two departments?

Code
kable(head(attrition1))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
41 Yes Travel_Rarely 1102 Sales 1 College Life_Sciences Medium Female 94 High 2 Sales_Executive Very_High Single 5993 19479 8 Yes 11 Excellent Low 0 8 0 Bad 6 4 0 5
49 No Travel_Frequently 279 Research_Development 8 Below_College Life_Sciences High Male 61 Medium 2 Research_Scientist Medium Married 5130 24907 1 No 23 Outstanding Very_High 1 10 3 Better 10 7 1 7
37 Yes Travel_Rarely 1373 Research_Development 2 College Other Very_High Male 92 Medium 1 Laboratory_Technician High Single 2090 2396 6 Yes 15 Excellent Medium 0 7 3 Better 0 0 0 0
33 No Travel_Frequently 1392 Research_Development 3 Master Life_Sciences Very_High Female 56 High 1 Research_Scientist High Married 2909 23159 1 Yes 11 Excellent High 0 8 3 Better 8 7 3 0
27 No Travel_Rarely 591 Research_Development 2 Below_College Medical Low Male 40 High 1 Laboratory_Technician Medium Married 3468 16632 9 No 12 Excellent Very_High 1 6 3 Better 2 2 2 2
32 No Travel_Frequently 1005 Research_Development 2 College Life_Sciences Very_High Male 79 High 1 Laboratory_Technician Very_High Single 3068 11864 0 No 13 Excellent High 0 8 2 Good 7 7 3 6
Code
table(attrition1$Department)

Research_Development                Sales 
                 961                  446 
Code
t<- table(attrition1$Department,
      attrition1$Attrition)
t
                      
                        No Yes
  Research_Development 828 133
  Sales                354  92
Code
yes_counts<- as.numeric(t[,2])
sampleSize <-as.numeric(table(attrition1$Department))

testResult <- prop.test(yes_counts, sampleSize)
testResult

    2-sample test for equality of proportions with continuity correction

data:  yes_counts out of sampleSize
X-squared = 9.9491, df = 1, p-value = 0.001609
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.11295996 -0.02280109
sample estimates:
   prop 1    prop 2 
0.1383975 0.2062780 

Our \(p\)-value 0.0016093 is very low which means we can reject the default null hypothesis (the difference in the two samples is 0). It is also giving us the proportions for the samples in the two different departments: 0.1383975, 0.206278 . And the confidence intervals is giving us the difference in the proportion of the two departments (as in prop1 - prop2). In this case being negative means that the second department has a higher attrition rate. The order of the variables is as entered in the vectors, so if yes_counts had Research_Development first and Sales second, prop1 will be for Research_Development and prop2 for Sales

8.7 Two-samples of categorical variables

Two samples of a single categorical variable can be compared with the \(x^2-test\) for homogeneity (Chi-squared). Under the hood, this is just a goodness-of-fit test of the hypothesis that the observed proportions in one of the samples are the same as those in the pooled sample.

8.7.1 Testing homogeneity

The \(\chi^2\) test of homogeneity test the null hypothesis that the distribution of a categorical variable is the same for several populations. It assumes that the samples are drawn independently within and across populations.

See how we can apply this logic to our Titanic survival data:

First Second Third Crew
Survived 202 118 178 215
Died 123 167 528 698

Note that in this case we are not sampling from a population. The data are not a random sample of the people on board, rather the data represent the whole population. Son in this case the chance process resulting in survival or death is not the sampling, but the result of random events occurring when looking for a way out of the ship, like getting into a life boat or into the water, being rescued from the water on time, etc. Then the 325 observations of first class passengers represent 325 independent draws from a probability histogram that gives a certain chance for survival. The 285 observations about second class passengers are drawn from the probability histogram for second class passengers, which may be different. The null hypothesis says that the probability of survival is the same for all four probability histograms. According to this hypothesis we can calculate the probability of survival by pooling all the data = \(\frac{713}{2229}=32\%\) with this number we can calculate the expected number of surviving passengers for each class:

Surviving First Second Third Crew
Observed 202 118 178 215
Expected 104.0 91.2 225.8 292.1
Died First Second Third Crew
Expected 221.0 193.8 480.1 620.8
Observed 123 167 528 698

Now we can compare our chi statistic as we learned, using all the differences between expected and observed values:

\[ \chi^2 =\sum_{categories} \frac{(observed-expected)^2}{expected} \]

\[ =\frac{(202-104)^2}{104}+\frac{(123-221)^2}{221}+\cdots =192 \]

In this case our degrees of freedom are calculated like this: (4-1)*(2-1) = 3 where 4 is for the number of categories, and 2 is for the two rows of results we are dealing with (surviving and died)

In our case our \(p\) value will be extremely small, suggesting we should reject the null hypothesis that all ticket classes had the same possibility of survival.

Code
x <- 192.2
df <- 3

# Calculate the p-value
p_value <- pchisq(x, df, lower.tail = FALSE)

# Print the p-value
p_value
[1] 0.00000000000000000000000000000000000000002043428

The test in r:

Code
# Create the contingency table
titanic_data <- matrix(c(202, 118, 178, 215, 123, 167, 528, 698), 
                       nrow = 2, 
                       byrow = TRUE,
                       dimnames = list(Survival = c("Survived", "Died"),
                                       Class = c("First", "Second", "Third", "Crew")))

# Print the contingency table
print(titanic_data)
          Class
Survival   First Second Third Crew
  Survived   202    118   178  215
  Died       123    167   528  698
Code
# Perform the chi-squared test of homogeneity
chi_squared_test <- chisq.test(titanic_data)

# Print the test results
print(chi_squared_test)

    Pearson's Chi-squared test

data:  titanic_data
X-squared = 192.34, df = 3, p-value < 0.00000000000000022

Exercise:

Use the substance_abuse data set to decide if the distribution of mental health diagnosis is the same for those with and without at least one psychiatric admission.

Code
as.data.frame(report(substance_abuse$MHDx))
n_Entries | n_Obs | n_Missing | percentage_Missing
--------------------------------------------------
4.00      |   479 |         0 |               0.00
Code
as.data.frame(report(substance_abuse$PsychAdmit))
Mean |   SD | Median |  MAD |  Min |  Max | n_Obs | Skewness | Kurtosis | percentage_Missing
--------------------------------------------------------------------------------------------
0.61 | 0.81 |   0.00 | 0.00 | 0.00 | 5.00 |   479 |     1.41 |     2.39 |               0.00
Code
substance_abuse <- substance_abuse %>% mutate(previousAdmit = ifelse(substance_abuse$PsychAdmit == 0, FALSE, TRUE))

t<- table( substance_abuse$previousAdmit,substance_abuse$MHDx)
t
       
        Anxiety Depression Psychosis Trauma
  FALSE      84         85        36     59
  TRUE       46         65        65     39
Code
chisq.test(t)

    Pearson's Chi-squared test

data:  t
X-squared = 21.394, df = 3, p-value = 0.00008719

In this case according to our \(p\)-value we cannot say that patients that had at least one psychiatric admission in the past have the same proportion of diagnosis than patients without any admission.

Chi squared test for homogeneity is not saying anything specific about any of these diagnosis, it is simply saying that the distribution of these diagnosis between those two groups (admitted vs not admitted) is not the same.

Is the distribution of SUDx the same for both men and women in our Substance Abuse data?

Code
head(substance_abuse)
# A tibble: 6 × 14
  `Admission Date`    PPID  Program   Age Gender RaceEthnicity MHDx  SUDx  MedDx
  <dttm>              <chr> <chr>   <dbl> <chr>  <chr>         <chr> <chr> <chr>
1 2022-01-13 00:00:00 A234… Interv…    34 F      Other         Depr… Alco… 2    
2 2022-02-18 00:00:00 A232… Interv…    26 M      NonHispWhite  Trau… Opio… 0    
3 2022-01-28 00:00:00 A259… Interv…    62 M      NativeAm      Depr… Opio… 0    
4 2022-01-30 00:00:00 A353… Interv…    34 F      NonHispWhite  Depr… Alco… 0    
5 2022-03-28 00:00:00 A302… UsualC…    46 M      NonHispBlack  Trau… Opio… 0    
6 2022-02-17 00:00:00 A315… Interv…    51 M      NonHispWhite  Anxi… Opio… 1    
# ℹ 5 more variables: PsychAdmit <dbl>, DLA1 <dbl>, DLA2 <dbl>,
#   DLA_improvement <dbl>, previousAdmit <lgl>
Code
as.data.frame(report(substance_abuse$SUDx))
n_Entries | n_Obs | n_Missing | percentage_Missing
--------------------------------------------------
4.00      |   479 |         0 |               0.00
Code
t <- table(substance_abuse$Gender, 
           substance_abuse$SUDx)
head(t)
   
    Alcohol None Opioid Stimulant
  F      68   36     58        17
  M      99   66    104        31

Our null hypothesis is that the distribution between gender is the same.

Code
result <- chisq.test(t)
report(result)
Effect sizes were labelled following Funder's (2019) recommendations.

The Pearson's Chi-squared test of independence between and suggests that the
effect is statistically not significant, and tiny (chi2 = 1.24, p = 0.744;
Adjusted Cramer's v = 0.00, 95% CI [0.00, 1.00])

In this case the \(p\)-value of the test is over 0.05 so that indicates that there is no significance difference by sex for the SUDx variable, so men and women are abusing the different categories of substances in the same proportions. The df in our results are the degrees of freedom that is the number of categories minus one.

In reality this is the same as doing a Goodness of fit test as we saw before, if we remember it was \(chisq.test(p_s , p)\) where \(p_s\) was the proportions of the different categories in our sample and \(p\) the proportion of the population we wanted to test against. Back to our example what we are measuring here is the proportion of women or men against the proportion of the full sample, that will tell us if there is a difference between men and women, so we can also write the test like this getting similar results:

Code
sudxProp <- table(substance_abuse$SUDx)/sum(table(substance_abuse$SUDx))

women <- t[1,]

result<- chisq.test(women, p= sudxProp)
report(result)
Effect sizes were labelled following Funder's (2019) recommendations.

The Chi-squared test for given probabilities / goodness of fit of women to a
distribution of [Alcohol: n=62.4070981210856, None: n=38.1169102296451, Opioid:
n=60.5386221294363, Stimulant: n=17.937369519833] suggests that the effect is
statistically not significant, and tiny (chi2 = 0.77, p = 0.856; Fei = 0.02,
95% CI [0.00, 1.00])

8.7.2 Independence Testing of Categorical Variables

We want to answer this question: Is gender (M/F) related to voting preference (liberal/conservative)? Now we have two categorical variables: gender and voting preference. The null hypothesis is that the two variables are independent. The alternative hypothesis is that there is some kind of association.

The tool we are going to use is the chi-square test ( \(x^2\) ) for independence. This test can be used to test whether two categorical variables are independent or not. That is, whether knowledge about one provides information about the other. The math is exactly the same as for the chi-square test for homogeneity, hence, so is the sintax in most statistical packages, including R, but with a few things to bear in mind:

  • Technically, the null hypothesis of a \(x^2\) test for independence is that in the larger population, the probability of an observation being in any specific pair of categories is equal to the product of the probabilities of being in each of the individual categories.

  • This is another omnibus test: it says nothing about particular categories.

  • Larger cell counts are better, as usual. Do not do this test if any of the counts are less than 5.

We are going to use our product_rating dataset to see the relationship between age groups and the product the user purchased.

first we can just see the contingency table for those two variables

Code
file <- here::here("data", "product_ratings.xlsx")
product_ratings <- read_excel(file)
t <- table(product_ratings$age,
           product_ratings$product)
t
       
          A   B   C
  <20   109 136 346
  20-34 104 155 272
  35-49 145 149 301
  50-64  95  86 227
  65+    75  47 118
Code
chisq.test(t)

    Pearson's Chi-squared test

data:  t
X-squared = 30.81, df = 8, p-value = 0.0001519

The low \(p\)-value from this test is indicating that the theory that these two categorical variables are independent is implausible.

Chi-Squared is used for testing the independence of categorical variables. It compares observed frequencies in a contingency table to expected frequencies under independence.

Exercises

Referring to the substance_abuse data set:

  • Is there any evidence of an association between Substance Use Diagnosis and Mental Health Diagnosis among patients receiving an intervention?
Code
treatment <- filter(substance_abuse, 
                    Program == "Intervention")
t<- table (treatment$SUDx, treatment$MHDx)
chisq.test(t)

    Pearson's Chi-squared test

data:  t
X-squared = 7.6914, df = 9, p-value = 0.5655

According to this result we cannot conclude that there is a relationship between these two variables.

  • Is there any evidence of an association between SUDx and DLA_improvement among patients receiving an intervention?
Code
ggplot(treatment, aes(x= SUDx, y = DLA_improvement))+
  geom_boxplot()

Code
testResult <- aov( DLA_improvement ~ SUDx, data = treatment)
summary(testResult)
             Df Sum Sq Mean Sq F value Pr(>F)  
SUDx          3  0.288 0.09606   2.981 0.0336 *
Residuals   139  4.479 0.03222                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference is significant but not by large as the \(p\) value shows that in 3% of the cases we could observe these data differences in a sample from the a population where their category means is the same.

If we run a TukeyHSD test on these we can compare each pair individually and we find out that actually the only pair that shows a statistically significant difference is alcohol with opioids.

Code
TukeyHSD(testResult)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = DLA_improvement ~ SUDx, data = treatment)

$SUDx
                          diff         lwr          upr     p adj
None-Alcohol      -0.068704082 -0.19256733  0.055159168 0.4752566
Opioid-Alcohol    -0.095893737 -0.18646925 -0.005318221 0.0334714
Stimulant-Alcohol -0.002079082 -0.13648847  0.132330308 0.9999765
Opioid-None       -0.027189655 -0.14823533  0.093856018 0.9367288
Stimulant-None     0.066625000 -0.08994452  0.223194521 0.6861228
Stimulant-Opioid   0.093814655 -0.03800277  0.225632083 0.2543154

It is easy to confuse the testing for homogeneity and the testing for independence.

8.7.3 Comparing the different uses of the chi-square test:

sample research question
goodness of fit single categorical variable. one sample Are the counts of the different categories matching our expected results
homogeneity single categorical variable measured on several samples Are the groups homogeneous (have the same distribution of the categorical variable)
independence two categorical variables measured on one sample Are the two categorical variables independent?

Examples:

  • we want to know if there is more births in different week days (Monday, Tuesday…). We record the week day of 300 births. What test should we use to know if there is a difference per day? –> We would use chi-square test for goodness of fit.

  • A food delivery start-up decides to advertise its service by placing ads on web pages. They wonder whether the percentage of viewers who click on the ad changes depending on how often the viewers were shown the ad. They randomly select 100 viewers from among those who were shown the add once, 135 from among those who were shown the add twice, and 150 from among those who were shown the ad three times. –> We would use chi-square test of homogeneity.

  • An airline wants to find out whether there is a connection between the customer’s status in its frequent flyer program and the class of ticket that the customer buys. It samples 1,000 ticket records at random and for each ticket notes the status level (‘none’, ‘silver’, ‘gold’) and the ticket class (‘economy’, ‘business’,‘first’). –> chi-square test of independence

  • A county wants to check whether the racial composition of the teachers in the county corresponds to that of the population in the county. It samples 500 teachers at random and wants to compare that sample with the census numbers about the racial groups in that county. –> chi-square test for goodness of fit

9 Permutation Tests

Suppose we have a situation in which none of the standard mathematical statistical approximations apply. We have computed a summary statistic, such as the difference in mean, but do not have a useful approximation, such as that provided by the CLT. In practice, we do not have access to all values in the population so we can’t perform a simulation. Permutation tests can be useful in these scenarios.

We are back to the scenario were we only have 10 measurements for each group.

Code
dat=read.csv("data/femaleMiceWeights.csv")

control <- filter(dat,Diet=="chow") %>% dplyr::select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% dplyr::select(Bodyweight) %>% unlist
obsdiff <- mean(treatment)-mean(control)
obsdiff
[1] 3.020833

In previous sections, we showed parametric approaches that helped determine if the observed difference was significant. Permutation tests take advantage of the fact that if we randomly shuffle the cases and control labels, then the null is true. So we shuffle the cases and control labels and assume that the ensuing distribution approximates the null distribution. Here is how we generate a null distribution by shuffling the data 1,000 times:

Code
N <- 12
avgdiff <- replicate(1000, {
    all <- sample(c(control,treatment))
    newcontrols <- all[1:N]
    newtreatments <- all[(N+1):(2*N)]
  return(mean(newtreatments) - mean(newcontrols))
})
hist(avgdiff)
abline(v=obsdiff, col="red", lwd=2)

How many of the null means are bigger than the observed value? That proportion would be the \(p\)-value for the null. We add a 1 to the numerator and denominator to account for misestimation of the \(p\)-value. The \(p\)-value here is calculated directly as the percentage of values higher than our number of interest.

Code
#the proportion of permutations with larger difference
(sum(abs(avgdiff) > abs(obsdiff)) + 1) / (length(avgdiff) + 1)
[1] 0.05294705

Now let’s repeat this experiment for a smaller dataset. We create a smaller dataset by sampling:

Code
N <- 5
control <- sample(control,N)
treatment <- sample(treatment,N)
obsdiff <- mean(treatment)- mean(control)
obsdiff
[1] 4.51
Code
avgdiff <- replicate(1000, {
    all <- sample(c(control,treatment))
    newcontrols <- all[1:N]
    newtreatments <- all[(N+1):(2*N)]
  return(mean(newtreatments) - mean(newcontrols))
})
hist(avgdiff)
abline(v=obsdiff, col="red", lwd=2)

Now the observed difference is not significant using this approach. Keep in mind that there is no theoretical guarantee that the null distribution estimated from permutations approximates the actual null distribution. For example, if there is a real difference between the populations, some of the permutations will be unbalanced and will contain some samples that explain this difference. This implies that the null distribution created with permutations will have larger tails than the actual null distribution. This is why permutations result in conservative \(p\)-values. For this reason, when we have few samples, we can’t do permutations.

Note also that permutations tests still have assumptions: samples are assumed to be independent and “exchangeable”. If there is hidden structure in your data, then permutation tests can result in estimated null distributions that underestimate the size of tails because the permutations may destroy the existing structure in the original data.

10 Multiple test corrections

The Bonferroni correction is a statistical method used to address the problem of multiple comparisons. When you perform multiple hypothesis tests, the chance of making a Type I error (false positive) increases. The Bonferroni correction helps control this by adjusting the significance level.

What we do is, if there are m test, we multiply the \(p\)-values by m. This correction makes sure that P(any of the m test rejects in error) is still smaller than 5%. The Bonferroni correction is often very restrictive. It guards against having even one false positive among the m test. As a consequence the adjusted \(p\)-values may not be significant any more even if a noticeable effect is present, so it will only work if you don’t have a large number of test.

Alternatively if the number of tests is large, it is better to use the False Discovery Proportion (FDP):

\[ FDP= \frac{number\ of\ false\ discoveries}{total\ number\ of\ discoveries} \]

where a discovery occurs when a test rejects the null hypothesis. If no hypothesis are rejected, FDP is defined to be 0.

The False Discovery Rate (FDR) is the expected proportion of false discoveries among the discoveries. One common method to control the FDR is the Benjamini-Hochberg procedure. This method adjust the \(p\)-values to control the FDR at a desired alpha level. The procedure is like this:

  1. Sort the \(p\)-values in ascending order. \(p_{(1)}, p_{(2)}, \ldots, p_{(m)}\)

  2. Find the largest k value such as \(p_k=\frac{k}{m}\alpha\)

  3. Reject the null hypothesis for all \(p_i\) where \(i\leq k\)

In r:

Code
# Example p-values
p_values <- c(0.01, 0.04, 0.03, 0.002, 0.05, 0.20)

# Apply the Benjamini-Hochberg procedure
p.adjusted <- p.adjust(p_values, method = "BH")

print(p.adjusted)
[1] 0.030 0.060 0.060 0.012 0.060 0.200

This will give you the adjusted \(p\)-values, and you can compare them to your desired FDR level to decide which hypotheses to reject.

The validation set approach. With this approach you split your data set into two parts, one is a model building set and a validation set before the analysis. You may use data snooping (multiple testing) in the first set of data to find something interesting. And then you test this hypothesis on the validation set..

11 Choosing the right test for your data

Source: Bevans, R. (2023, June 22). Choosing the Right Statistical Test Types & Examples. Scribbr. Retrieved November 18, 2024

11.0.1 Statistical assumptions

Statistical tests make some common assumptions about the data they are testing:

  1. Independence of observations (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).

  2. Homogeneity of variance: the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test’s effectiveness.

  3. Normality of data: the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data.

If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a nonparametric statistical test, which allows you to make comparisons without any assumptions about the data distribution.

If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).

11.0.2 Types of variables

The types of variables you have usually determine what type of statistical test you can use.

Quantitative variables represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:

  • Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).

  • Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one (e.g. 1 tree).

Categorical variables represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:

  • Ordinal: represent data with an order (e.g. rankings).

  • Nominal: represent group names (e.g. brands or species names).

  • Binary: represent data with a yes/no or 1/0 outcome (e.g. win or lose).

Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment, these are the independent and dependent variables). Consult the tables below to see which test best matches your variables.

11.1 Choosing a parametric test: regression, comparison, or correlation

Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

11.1.1 Regression tests

Regression tests look for cause-and-effect relationships. They can be used to estimate the effect of one or more continuous variables on another variable.

11.1.2 Comparison tests

Comparison tests look for differences among group means. They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).

Predictor Variable Outcome Variable Example
Paired t-test

Categorical

1 predictor

Quantitative

Groups come from the same population

What is the effect of two different test prep programs on the average exam scores for students from the same class?
(Predictor: test program)
Independent t-test

Categorical

1 predictor

Quantitative

Groups come from different populations

What is the difference in average exam scores for students from two different schools?

(Predictor: school A/B)

ANOVA

Categorical

1 or more predictor

Quantitative

1 outcome

What is the difference in average pain levels among post-surgical patients given three different treatments?
MANOVA

Categorical

1 or more predictor

Quantitative

2 or more outcomes

What is the effect of flower species on petal length, petal width and stem length?
Pearson’s r 2 continuous variables 2 continuous variables How are latitude and temperature related?

11.2 Flowchart for parametric tests

11.2.1 Parametric Tests

Parametric tests are statistical tests that make certain assumptions about the parameters of the population distribution from which the sample is drawn. These assumptions typically include:

  • Normality: The data follows a normal distribution.

  • Homogeneity of variance: The variances within each group being compared are similar.

  • Independence: The observations are independent of each other.

Because of these assumptions, parametric tests are generally more powerful and can make stronger inferences about the population. Some common parametric tests include:

  • T-tests: Used to compare the means of two groups.

  • ANOVA (Analysis of Variance): Used to compare the means of three or more groups.

  • Regression Analysis: Used to examine the relationship between one or more predictor variables and an outcome variable.

11.2.2 Non-Parametric Tests

Non-parametric tests, on the other hand, do not make strict assumptions about the population parameters. They are more flexible and can be used when the assumptions of parametric tests are not met, such as when the data does not follow a normal distribution or when sample sizes are small. Non-parametric tests are also useful for ordinal data or data with outliers.

Some common non-parametric tests include:

  • Mann-Whitney U Test: Used to compare the medians of two independent groups.

  • Wilcoxon Signed-Rank Test: Used to compare the medians of two related groups.

  • Kruskal-Wallis Test: Used to compare the medians of three or more independent groups.

  • Spearman’s Rank Correlation: Used to assess the relationship between two ordinal variables.

11.2.3 Key Differences

  • Assumptions: Parametric tests assume specific population parameters, while non-parametric tests do not.

  • Data Type: Parametric tests are typically used for interval or ratio data, whereas non-parametric tests can be used for ordinal data or data that does not meet parametric assumptions.

  • Power: Parametric tests are generally more powerful when their assumptions are met, meaning they are more likely to detect a true effect. Non-parametric tests are more robust and flexible but may be less powerful.

11.2.4 Example Scenario

Imagine you want to compare the test scores of students from two different schools. If the test scores are normally distributed and the variances are similar, you would use a parametric test like the independent t-test. However, if the test scores are not normally distributed or have outliers, you might use a non-parametric test like the Mann-Whitney U Test.

11.3 Choosing a non parametric test


Predictor variable
Outcome variable Use in place of…
Spearman’s r Quantitative Quantitative Pearson’s r
Chi square test of independence Categorical Categorical Pearson’s r
Sign test Categorical Quantitative One-sample t-test
Kruskal–Wallis H

Categorical

3 or more groups

Quantitative ANOVA
ANOSIM

Categorical

3 or more groups

Quantitative

2 or more outcome variables

MANOVA
Wilcoxon Rank-Sum test

Categorical

2 groups

Quantitative

groups come from different populations

Independent t-test
Wilcoxon Signed-rank test

Categorical

2 groups

Quantitative

groups come from the same population

Paired t-test
Source Code
---
title: "Inference"
format: html
editor: visual
---

```{r}
#| echo: false
library(dplyr)
library(tidyverse)
library(here)
library(readxl)
library(easystats)
library(infer)
library(kableExtra)
library(plotly)
library(ggplot2)
library(patchwork)
library(DiagrammeR)
theme_set(theme_minimal())
options(scipen= 999)
```

# Introduction to Inference

```{r, echo=FALSE}
#| echo: false
file <- here::here("data", "optical_full.xlsx")
optical <- read_excel(file)
```

Drawing conclusions from data is difficult because we almost never have complete information about any variable of interest. For instance, we may have only been able to send a satisfaction survey to a small subset of our customers.

-   The **population** of a study consists of all possible observations of interest.

-   A **sample** in a study consists of all the observations we actually have.

We are almost always interested in a population, but only have information about a sample. Statistical inference is the process of reasoning from the one to the other.

-   A **parameter** is a number that describes a population

-   A **statistic** is a number that describes a sample

The mean of the eye difference variable in our optical dataset is `r mean(optical$eye_difference)`. The **sample mean** is found by adding together all the values of a variable in the sample data, and dividing this total by $n$, the sample size.

If we extract 100 samples with three observations each from our data and visually show their means we can see every time we got a different value, sometimes they will be very close to the target parameter, but sometimes they can be quite far.

```{r, fig.align='center', echo=FALSE}
#| eval: true

kable(head(optical))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
as.data.frame(report(optical$eye_difference))

sample <- slice_sample(optical, n = 3)
mean(sample$eye_difference)

samples <- rep_slice_sample(optical, n = 3, reps = 100)

sample_summary <- samples |> 
  group_by(replicate) |> 
  dplyr::summarize(sample_mean = mean(eye_difference))

ggplot(sample_summary, aes(x = sample_mean)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(optical$eye_difference),
             linetype = "dotted")

```

**Variability** is the natural tendency for a statistic to be different across different samples.

A **Biased statistics**, on the other hand, misses the mark on average, for example if we increase the value of all our sample means by 1:

```{r, fig.align='center', echo=FALSE}
ggplot(sample_summary, aes(x = sample_mean+1)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(optical$eye_difference),
             linetype = "dotted")
```

::: {.callout-orange appearance="simple" icon="false"}
**Reproducibility and Replicability**

Replicating means to get the same conclusion with slightly different samples, procedures, and data analysis methods.

Related to that is the problem of reproducibility. That means to get the same results then you simply use the same data and same methods that were claimed in the analysis.
:::

::: {.callout-orange appearance="simple" icon="false"}
::: centered-text
-   Shortcomings of statistical analysis \*
:::

Statistical inference can handle variability, but not bias. Bias comes from incorrect data collection and only good data collection and management practices can prevent bias.

Statistical inference can lead to wildly wrong conclusions when misused or abused for example:

-   Sample bias: not choosing the sample randomly.

-   Undercoverage: your procedure miss out keep portions of the population.

-   Overgeneralization: you need to limit the conclusions that you draw to the population that you have sampled from.

Significance test are particulary easy to abuse:

-   If you run a test just because you've spotted a pattern in your data, you're almost certain to get a positive result even if the trend is just due to random chance. Patterns exist just by chance.

-   p-hacking: repeated testing of the same hypothesis will eventually yield a positive result just by random chance.

-   Confusing significance for importance. Just because a difference is significant does not mean it is important.

No amount of statistical savvy can fully remediate bad data. All of the techniques we'll discuss in this course are designed to address variability in data, not bias. Before attempting any sort of analysis, you should always take some time to think critically about the overall quality of the data. A few questions to consider:

-   Is the sample representative of the population of interest, or might some subgroups be under or over represented?

-   How was the data collected? Are any of the variables self-reported? Were there points in the process where those collecting the data made subjective judgments?

-   How will conclusions from the data analysis be used? How might they be misinterpreted or abused?
:::

# Expected Value and Standard Error:

If we are interested in the average height of adult males in USA and we extract an adult male at random, we expect his height to be around the population average, give or take about one standard deviation $\sigma$ . **The expected value is the population average** $\mu$ . The expected value of the sample average is still the population average, but remember that the population mean is a random variable because sampling is a random process, so its mean won't be exactly the population mean. How far off can we expect this sample mean to the population mean?

**The standard error (SE)** of a statistic tell you roughly how far off the sample mean will be from its expected value (population mean). It indicates how much the sample mean is expected to vary from the true population mean.

$$ 
\text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}} 
$$ {#eq-standardErrorMean}

where n is the sample size. **We can actually use this formula to decide what is the sample size we need to use for a desired accuracy in our statistic**.

It plays a similar role as the *standard deviation* for one observation from the population mean, remember that the Standard deviation measures the dispersion of individual data points around the mean of a dataset. It tells you how spread out the values in a dataset are. In summary, the standard error of an estimate is the standard deviation of the sampling distribution of an estimate.

::: exercise-box
**Exercise:**

*A town has 10,000 registered voters, of whom 6,000 are voting for the Democratic party. A survey organization is taking a sample of 100 registered voters (assume sampling with replacement). The percentage of Democratic voters in the sample will be around \_\_\_\_\_, give or take \_\_\_\_. (You may use the fact that the standard deviation is about 0.5)*

The percentage of Democratic voters in the sample is equal to the mean of the sample, and so the "around" value is the **expected value of the sample mean**, which in turn is equal to the population mean, 0.60 or 60%. Similarly, the "give or take" value is the *standard error* (SE) of the mean, which is equal to $\frac{\sigma}{\sqrt{n}}$ , where σ is the standard deviation of the population and n is the size of the sample. We're told that σ is about 0.5 and n is 100, so the standard error of the mean is 0.05 or 5%.
:::

### Expected value and standard error for the sum

If we are interested in the sum of n draws $S_n$, the sum and the average are related by $S_n = n\bar{x_n}$ so the standard error of the sum: $$
SE(S_n)=\sqrt{n}\sigma
$$ {#eq-standardErrorSum}

which tells us that the variability of the sum of n draws increases at the rate of $\sqrt{n}$ so **while increasing the sample size reduces the standard error of the average, it actually increases the standard error for the sum.**

::: exercise-box
Exercise1 (Expected Value):

*You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be \$10, \$50, or \$100. You may use the fact that the standard deviation of the three amounts \$10, \$50 and \$100 is \$37.*

*What is the expected value of the sum of the 100 pledges?*

The expected value of the sum of a sample is nμ, where n is the size of the sample and μ is the mean of the population. Here we're told that n is 100 and μ must be the average of \$10, \$50, and \$100, which is \$53.33, so the expected value of the sample sum is \$5333.
:::

::: exercise-box
Exercise2 (Standard error for the sum)

*You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be \$10, \$50, or \$100. You may use the fact that the standard deviation of the three amounts \$10, \$50 and \$100 is \$37.*

*What are the chances that the 100 pledges total more than \$5,700 ?*

The standard error of the sample sum is $SE(S_n)=\sqrt{n}\sigma$ where n is the sample size and σ is the standard deviation of the population.

Since n is 100 and we're told that the standard deviation of the population is \$37, we know the standard error of the sample sum is \$370.

Using the normal approximation to the sampling distribution, this tells us that \$5700 is roughly one standard deviation (5700 - 5333 = 366) above the mean of the sampling distribution. From the **empirical rule**, we know that one half of 68% of the data lies within one standard deviation to the right of the mean, so that 50 + (1/2)68 = 84% of the data lies to the left of \$5700. That is, 16% percent of the data lies to right of \$5700.
:::

### Expected value and standard error for percentages.

For percentages, the expected value is $$E = \mu * 100\%$$ {#eq-expectedValuePerc} and the Standard Error $$SE=\frac{\sigma}{\sqrt{n}} *100\%$$ {#eq-standardErrorPerc}

All the above formulas are for sampling with replacement, but they are still approximately true when sampling without replacement if the sample size is much smaller than the size of the population.

------------------------------------------------------------------------

::: exercise-box
Exercise:

*We toss a coin 100 times. How many tails do you expect to see?*

we assing tails a value of 1, and heads a value of 0.

p(1)=0.5, p(0)=0.5

Number of expected tails = $E(sum) = 100 * \mu$

and $\mu = 0*p(0)+1*p(1)$ , so $\mu =0.5$ and $E= 100*0.5$ =50 tails.

The Standard Error will be $SE(sum) = \sqrt{100}*\sigma$

to calculate sigma, we know that $\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2}$ where:

-   ($\sigma$) is the standard deviation,

-   ($N$) is the total number of observations,

-   ($X_i$) represents each individual observation,

-   ($\mu$) is the mean of the observations.

In our exercise, without having to toss the coing 100 times we calculate it like this:

$\sigma =\sqrt{(x_1-\mu)^{2} * p(1) +(x_2-\mu)^{2}*p(2)}$

$\sigma =\sqrt{(0-0.5)^{2} * 0.5 +(1-0.5)^{2}*0.5}=0.5$

$\sigma^2 ={(0-0.5)^{2} * 0.5 +(1-0.5)^{2}*0.5}=0.25$

$SE(sum) = \sqrt{100}\sigma^2=10*0.25=5$

so our answer is that we expect to see 50 tails, give or take 5 tails.

and in percentages we would expect to see E= 0.5\*100 =50% tails with an standard error of SE =0.5/10\*100 = 5%

in r code, an example of the same problem with n=200 tosses. See the difference in the percentages.

```{r}
# Define the probabilities 
p <- c(0.5, 0.5)  
# Define the values (0 for heads, 1 for tails) 
x <- c(0, 1)  
# Calculate the mean (μ) 
mu <- sum(x * p)  
# Calculate the variance (σ^2) 
variance <- sum((x - mu)^2 * p)  
# Calculate the standard deviation (σ) 
sigma <- sqrt(variance)  
# Number of coin tosses 
n <- 200  
# Calculate the standard error (SE) 
SE <- sqrt(n) * sigma  
 
cat("Expected percentage of tails:", n * mu, "\n") 
cat("Standard error:", SE, "\n") 
cat("Expected number of tails:", mu*100, " % \n") 
cat("Expected standard error percentage:", sigma /sqrt(n), "% \n")
```
:::

------------------------------------------------------------------------

## The sampling distribution

If we toss a fair coin 100 times, we can get any number of tails from 0 to 100. How likely is each outcome?

The number of tails has the binomial distribution with `n=100` and `p=0.5` where `success = tails`. So if the statistic of interest is $S_n=number\ of \ tails$ , then $S_n$ is a random variable whose probability histogram is given by the binomial distribution. This is called the *sampling distribution of the statistic* $S_n$. The sampling distribution provides more detailed information about the chance properties of $S_n$ than the summary numbers given by the expected value and the standard error alone.

There are three histograms to take into consideration:

### Probability histogram

The probability histogram for producing the data. Since both tails and heads have the same probability, our histogram would be two columns with the same height.

```{r, fig.align='center'}
#| echo: false
heads = rep(0,50)
tails = rep(1,50)
theoreticalP = c(heads,tails)
theoreticalP
# Convert the vector to a data frame
df <- data.frame(theoreticalP)


# Create the histogram
gg <- ggplot(df, aes(x = theoreticalP)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Histogram of Theoretical Probabilities", x = "Value", y = "Frequency")

# Print the plot
print(gg)
```

### Empirical histogram

If we toss the coins, we will have am empirical histogram

```{r, fig.align='center', echo=FALSE}
# Set the number of coin tosses
num_tosses <- 100

# Simulate the coin tosses
coin_tosses <- sample(c(0, 1), size = num_tosses, replace = TRUE, prob = c(0.5, 0.5))

# Create a dataframe
df <- data.frame(toss = 1:num_tosses, result = coin_tosses)

# Create the histogram
gg <- ggplot(df, aes(x = result)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "empirical histogram", x = "Value", y = "Frequency")

# Print the plot
print(gg)
```

### Probability histogram of the sample

And finally we can do the probability histogram of the statistic $s_{100}$, which shows the sampling distribution:

```{r, fig.align='center' , echo=FALSE}
# Set the parameters
n <- 100
p <- 0.5

# Calculate the probabilities for each number of tails (0 to 100)
tails <- 0:n
probabilities <- dbinom(tails, size = n, prob = p)

# Create a dataframe
df <- data.frame(tails = tails, probability = probabilities)

# Create the histogram
gg <- ggplot(df, aes(x = tails, y = probability)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  labs(title = "Probability Distribution of Number of Tails in 100 Coin Tosses",
       x = "Number of Tails",
       y = "Probability") +
  theme_minimal()

# Print the plot
print(gg)

```

# Central limit theorem (CLT)

The Central Limit Theorem is one of the most important results in statistics, basically saying that with sufficiently large sample sizes, the sample average approximately follows a normal distribution. This underscores the importance of the normal distribution, as well as most of the methods commonly used which make assumptions about the data being normally distributed.

Let's stop and think about what it means for the sample average to have a distribution. Imagine going to the store and buying a bag of your favorite brand of chocolate chip cookies. Suppose the bag has 24 cookies in it. Will each cookie have the exact same number of chocolate chips in it? It turns out that if you make a batch of cookies by adding chips to dough and mixing it really well, then putting the same amount of dough onto a baking sheet, the number of chips per cookie closely follows a Poisson distribution. (In the limiting case of chips having zero volume, this is exactly a Poisson process.) Thus we expect there to be a lot of variability in the number of chips per cookie. We can model the number of chips per cookie with a Poisson distribution. We can also compute the average number of chips per cookie in the bag. For the bag we have, that will be a particular number. But there may be more bags of cookies in the store. Will each of those bags have the same average number of chips? If all of the cookies in the store are from the same industrial-sized batch, each cookie will individually have a Poisson number of chips. So the average number of chips in one bag may be different from the average number of chips in another bag. Thus we could hypothetically find out the average number of chips for each bag in the store. And we could think about what the distribution of these averages is, across the bags in the store, or all the bags of cookies in the world. It is this distribution of averages that the central limit theorem says is approximately a normal distribution, with the same mean as the distribution for the individual cookies, but with a standard deviation that is divided by the square root of the number of samples in each average (i.e., the number of cookies per bag).

When sampling with replacement and $n$ is large, then the sampling distribution of the sample average approximately follows the normal curve.

For example, going back to the example of of the online gambling where we had a probability of winning a small prize with $p=0.2$, we can see how the probability histogram for the binomial distribution approximates more and more to the normal distribution as we increase the number of trials $n$.

```{r, fig.align='center', echo=FALSE, fig.width=12}
# Set the parameters
n <- 1
p <- 0.2

# Calculate the probabilities for each number of tails (0 to 100)
success <- 0:n
probabilities <- dbinom(success, size = n, prob = p)

# Create a dataframe
df <- data.frame(success = success, probability = probabilities)

# Create the histogram
gg <- ggplot(df, aes(x = success, y = probability)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  labs(title = "Probability histogram for the binomial n=1",
       x = "Number of successes with n=2",
       y = "Probability") +
  theme_minimal()

n <- 10
p <- 0.2

# Calculate the probabilities for each number of tails (0 to 100)
success <- 0:n
probabilities <- dbinom(success, size = n, prob = p)

# Create a dataframe
df <- data.frame(success = success, probability = probabilities)

gg2 <- ggplot(df, aes(x = success, y = probability)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  labs(title = "Probability histogram for the binomial n=10",
       x = "Number of successes with n=10",
       y = "Probability") +
  theme_minimal()

n <- 100
p <- 0.2

# Calculate the probabilities for each number of tails (0 to 100)
success <- 0:n
probabilities <- dbinom(success, size = n, prob = p)

df <- data.frame(success = success, probability = probabilities)

gg3 <- ggplot(df, aes(x = success, y = probability)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  labs(title = "Probability histogram for the binomial n=100",
       x = "Number of successes with n=100",
       y = "Probability") +
  theme_minimal()

combinedPlots<- gg + gg2+ gg3

combinedPlots
```

That means that we can use normal approximation to compute probabilities. The key point of the theorem is that we know that the sampling distribution of the statistic is normal no matter what the population histogram is.

The mathematical theory behind that:

The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average $\bar{y}$ of a random sample follows a normal distribution centered at the population average μ and with standard deviation equal to the population standard deviation σ, divided by the square root of the sample size N. We refer to the standard deviation of the distribution of a random variable as the random variable's standard error. We have seen this formula before when studying the Standard Error (@eq-standardErrorMean): $$ 
\text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}} 
$$

This is telling us that when we take more samples and plot their results in a histogram, the spread of the histogram becomes smaller (closer to the population average) as we saw in the histograms above.

**Standardized sample mean**

This implies that if we take many samples of size $N$, then the quantity is approximated with a normal distribution centered at 0 and with standard deviation 1

$$
\frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}}
$$ {#eq-standardizedSampleMean}

-   $\bar{Y}$: The sample mean, which is the average of your sample data.

-   $\mu$: The population mean, which is the average of the entire population from which the sample is drawn.

-   $\sigma_Y$: The population standard deviation, which measures the spread of the population data.

-   $N$: The sample size, or the number of observations in your sample.

The standardized sample mean is a way to transform the sample mean into a standard normal distribution (mean 0, standard deviation 1). Standardizing the sample mean by subtracting the population mean and dividing by the standard error $\sigma_Y/\sqrt{N}$ allows us to compare it to the standard normal distribution. This is useful because the properties of the normal distribution are well understood and widely applicable in statistical inference.

**Central limit theorem in practice.** The central limit theorem allow us, when we don't have access to the population data, to use the normal approximation so we can compute $p$-values.

We are going to use the mice bodyweight data for our example. We willwork only with female mice because the bodyweight of female and male mice are different. We calculate the mean and the standard deviation of control and treatment groups and those will be the parameteres of our population.

The CLT tells us that when the samples are large, the random variable is normally distributed. Thus we can compute $p$-values using the function `pnorm`

```{r miceexampleCLT}
dat <- read.csv("data/mice_pheno.csv") 
table(dat$Sex, dat$Diet)

controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  dplyr::select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  dplyr::select(Bodyweight) %>% unlist

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
sd_hf <- sd(hfPopulation)
sd_hf
sd_control <- sd(controlPopulation)
sd_control
```

Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we have to estimate them from samples.

As we described, the CLT tells us that for large N, each of these is approximately normal with average population mean and standard error population variance divided by N. We mentioned that a rule of thumb is that N should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of N.

We are going to create 10,000 samples of N number of mice for N= 3,12,25 and 50 and calculate the difference for the means between treatment and control bodyweight. We then calculate the t-statistic for each sample size (t-statistic is explained later)

Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.This ratio is what we call the t-statistic.

$$
t = \frac{\bar{x} - \mu}{SE}
$$ {#eq-tstatistic} where $SE = \frac{s}{\sqrt{n}}$

In the specific case of our experiment (comparing the means of two samples) this translates into:

$$ 
t = \frac{\bar{y} - \bar{x}}{SE}=\frac{\bar{y} - \bar{x}}{\sqrt{\frac{s_y^2}{n} + \frac{s_x^2}{n}}} 
$$

Finally we will plot a QQ plot for each sample size against the normal distribution to see their fit:

```{r, fig.align='center', fig.width=9, fig.height=9}
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations

##function to compute a t-stat
computetstat <- function(n) {
  y <- sample(hfPopulation,n)
  x <- sample(controlPopulation,n)
  (mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
}
res <-  sapply(Ns,function(n) {
  replicate(B,computetstat(n))
})
par(mfrow=c(2,2))
for (i in seq(along=Ns)) {
  qqnorm(res[,i],main=Ns[i])
  qqline(res[,i],col=2)
}
```

So we see that for $N=3$, the CLT does not provide a usable approximation. For $N=12$, there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation good.

This simulation only proves that $N=12$ is large enough in this case, not in general.The further the population distribution is from the normal distribution, the higher the sample size we will require.

For populations that are already normally distributed, even small sample sizes will result in sample means that are approximately normally distributed. For populations that are not normally distributed, larger sample sizes are required for the sample means to approximate a normal distribution.

We will see later that we don't usually calculate our $p$-values like this using the CLT, instead we will use a *t-test* that will give us a slightly different result. This is to be expected because our CLT approximation considered the denominator of t-stat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.

::: {.callout-orange appearance="simple" icon="false"}
**When does the central limit theorem apply?** For the normal approximation to work, the key requirements are:

-   we sample with replacement, or we simulate independent random variables from the same distribution. (If the sample size is much smaller than the population size, then sampling with replacement and sampling without replacement is approximately the same so this also applies)

-   The statistic of interest is a sum, average or percentage.

-   The sample size is large enough. The more skewed the population histogram is, the larger the required sample size n. If there is no strong skewness, then n=15 is sufficient.
:::

::: exercise-box
Exercise:

Suppose we are interested in the proportion of times we see a 6 when rolling n=100 dice. This is a random variable which we can simulate with x=sample(1:6, n, replace=TRUE) and the proportion we are interested in can be expressed as an average: mean(x==6). Because the die rolls are independent, the CLT applies.

We want to roll n dice 10,000 times and keep these proportions. This random variable (proportion of 6s) has mean p=1/6 and variance $\frac{p(1-p)}{n}$. So according to the CLT: $$
z = \frac{observed_p - p}{\sqrt{(p\times(1-p))/n)}}
$$

$z$ should be normal with mean 0 and SD 1.

Perform the simulation, and report what proportion of times $z$ was larger than 2 in absolute value (CLT says it should be about 0.05).

```{r, fig.align='center', fig.width=10}
n=100
N=10000
p=1/6
SE<- sqrt(p*(1-p)/n)
set.seed(1)

res<- replicate(N,sample(1:6,n,replace=TRUE))

prop = apply(res, 2, function(x) mean(x == 6))

par(mfrow = c(1,2))
hist(prop, freq=TRUE)

z_scores <- (prop-p)/SE
plot(z_scores,dnorm(z_scores))

proportion_greater_than_2 <- mean(abs(z_scores) > 2)
print(proportion_greater_than_2)
```
:::

# Significance test vs Confidence intervals

There are two fundamental sorts of questions we might ask our sample data:

-   How can we estimate a parameter while taking into account the variability of the data in an honest way?

-   How can we know if our data supports a specific conclusion, given its inherent variability?

The first of these questions leads to the idea of a **confidence interval**, where we specify not only an estimate of a parameter but also a reasonable margin of error. The latter question gives rise to **significance testing**.

::: {#confidenceIntvsSignifTest style="border: 2px solid #f0ad4e;  border-radius: 8px;   background-color: #fff3cd;   padding: 10px;"}
-   Use a confidence interval if you wish to estimate a parameter from a sample in a way that describes not only the observed mean but also the uncertainty surrounding it.

-   Use a significance test if there is one particular value of interest, for instance representing whether a piece of machinery is properly calibrated.
:::

Significance tests and confidence intervals are two sides of the same coin.

*Significance tests* consider specific individual values, while *confidence intervals* give more general information about the parameter of interest. The latter are more flexible and are generally easier to interpret. Consider a significance test only when there is a single parameter value of interest which you could identify before looking at the data.

# Confidence intervals

A confidence interval is a way of reporting an estimate of a parameter that includes information about how much variability could reasonably be expected due to random chance. A confidence interval for the mean of a quantitative variable has the form

$$
\mu = \bar{x} \pm ME
$$

where $\mu$ is the population mean, $\bar{x}$ is the observed sample mean and $ME$ is a **margin of error.**

## Manual calculation:

A confidence interval gives a range of plausible values for a population parameter. Usually the confidence interval is centered at an estimate for $\mu$ which is an average. Since the central limit theorem applies for averages, the confidence interval has a simple form:

$$
CI = estimate \mp ME
$$

where **ME is the margin of error** and can be decomposed like this:

$$
CI=estimate \mp z\ * SE = \mu \mp z*SE
$$ {#eq-confidenceInterval}

where SE is the *standard error* of the sample. If that is an average or a percentage then it is $SE= \frac{\sigma}{\sqrt{n}}$ as we saw in (@eq-standardErrorMean).

And $z$ is the $z$-score corresponding to the desired confidence level. For example, if we would like to have a 95% confidence level, our $z=1.96$

This comes from calculating the value of $z$ where 95 of our data is in the middle:

```{r, fig.align='center', echo=FALSE}

x <- seq(-4, 4, length=100)

# Calculate the corresponding y values for the standard normal distribution
y <- dnorm(x)

# Plot the bell curve
plot(x, y, type="l", lwd=2, col="blue", xlab="Z", ylab="Density", main="Standard Normal Distribution")

# Highlight the area between -1.96 and 1.96
polygon(c(-1.96, seq(-1.96, 1.96, length=100), 1.96), 
        c(0, dnorm(seq(-1.96, 1.96, length=100)), 0), 
        col=rgb(0.1, 0.1, 0.9, 0.2), border=NA)

# Add vertical lines at z = ±1.96
abline(v = 1.96, col = "red", lwd = 2, lty = 2)
abline(v = -1.96, col = "red", lwd = 2, lty = 2)

# Add annotation at z = 1.96
text(1.96, dnorm(1.96), labels = "z = 1.96", pos = 4, col = "red")

# Add annotation for 95% confidence interval
text(0, 0.1, labels = "95%", pos = 3, col = "blue")
```

To get to that $z$ value we use tables or we can use the quantile function `qnorm`:

```{r}
# Calculate the z-score for a 95% confidence interval
z_score <- qnorm(0.975)
z_score
```

The `0.975`value is used because for a 95% confidence interval, you need to capture the central 95% of the distribution, leaving 2.5% in each tail. Therefore, you look up the 97.5th percentile (0.975) to get the $z$-score.

For 90% confidence interval we are looking for the $z$ value in the normalized distribution where 90% of the data falls in our center range and 10% outside, so 5% in each tail.

```{r}
# Calculate the z-score for a 95% confidence interval
desiredConfidence <- 90
tails = (100-desiredConfidence )/2
percentileOfinterest <- desiredConfidence+tails
z_score <- qnorm(percentileOfinterest/100)
z_score
```

For a 99 confidence level:

```{r}
# Calculate the z-score for a 95% confidence interval
desiredConfidence <- 99
tails = (100-desiredConfidence )/2
percentileOfinterest <- desiredConfidence+tails
z_score <- qnorm(percentileOfinterest/100)
z_score
```

::: exercise-box
Confidence Intervals for Coin Flips

**Example 1: Flipping a Coin 100 Times**

In this example of flipping a coin 100 times, we observed 44 heads, resulting in the following 95% confidence interval for $p$:

$$
CI= \hat{p} \pm z \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}
$$

where: - $\hat{p}$ is the observed proportion of heads - $z$ is the critical value from the standard normal distribution for a 95% confidence level (approximately 1.96) - $n$ is the number of trials (100 in this case) \$ p = \frac{44}{100} \$

```{r}
n1 <- 100
x1 <- 44
p_hat1 <- x1 / n1
z <- 1.96

# Confidence interval calculation
(ci_lower1 <- p_hat1 - z * sqrt(p_hat1 * (1 - p_hat1) / n1))
(ci_upper1 <- p_hat1 + z * sqrt(p_hat1 * (1 - p_hat1) / n1))

```

p: (0.343, 0.537)

From this, we concluded that it is plausible that the coin may be fair because $p = 0.5$ is in the interval.

**Example 2: Flipping a Coin 100,000 Times**

Suppose instead that we flipped the coin 100,000 times, observing 44,000 heads (the same percentage of heads as before). Then using the method just presented, the 95% confidence interval for $p$ is:

```{r}
# Example 2: Flipping a Coin 100,000 Times
n2 <- 100000
x2 <- 44000
p_hat2 <- x2 / n2

# Confidence interval calculation
(ci_lower2 <- p_hat2 - z * sqrt(p_hat2 * (1 - p_hat2) / n2))
(ci_upper2 <- p_hat2 + z * sqrt(p_hat2 * (1 - p_hat2) / n2))

```

p: (0.437, 0.443)

*Is it reasonable to conclude that this is a fair coin with 95% confidence?* Based on this narrower confidence interval, it appears less likely that the coin is fair since $p = 0.5$ is not within the interval.

```{r, echo=FALSE}
# Data for plotting
data <- data.frame(
  Example = c("100 Flips", "100,000 Flips"),
  p_hat = c(p_hat1, p_hat2),
  CI_Lower = c(ci_lower1, ci_lower2),
  CI_Upper = c(ci_upper1, ci_upper2)
)

# Plot the confidence intervals
ggplot(data, aes(x = Example, y = p_hat)) +
  geom_point() +
  geom_errorbar(aes(ymin = CI_Lower, ymax = CI_Upper), width = 0.2) +
  labs(title = "95% Confidence Intervals for Coin Flips",
       x = "Example",
       y = "Proportion of Heads") +
  theme_minimal()

```
:::

**Estimating the SE with bootstrap principle**

We still have a problem for calculating the confidence interval following this, and it is that we need to know the standard deviation of the population $\sigma$ and we usually don't know it, but the [bootstrap principle](#bootstrap) states that we can calculate sigma by its sample version $s$ and still get an approximately correct confidence interval.

::: exercise-box
Example: We poll 1000 likely voters and find that 58% approve of the way the president handles his job.

$SE= \frac{\sigma}{\sqrt{n}} * 100$ where $\sigma = \sqrt{p(1-p)}$ where $p$ is the proportion of **all voters** who approve, but we don't know p, but the bootstrap principle tells us that we can replace $\sigma$ by $s$ so we can plug in the values from our survey here: $=\sqrt{0.58(1-0.58)} = 0.49$

So a 95% confidence interval for $p$ is: $$
58\% \mp 1.96 \frac{0.49}{\sqrt{1000}}*100
$$

which is approximately \[54.9%,61.1%\]
:::

The width of the confidence interval is determined by $z$ and the standard error SE. To reduce that *margin of error* we have two options, we can increase the sample size or decrease the confidence level. The sample size is square rooted so this means that to cut the width of the confidence level in half we need four times the sample size, and to reduce it 10 times we would need 100 times the sample size.

## Getting the confidence intervals from a test result

```{r}
#simulate our survey results
success<- rep(1,580)
failure<- rep(0,420)
mydata <- c(success,failure)
#do a te.test.
testResult <- t.test(mydata)
confInt95_low <- testResult$conf.int[1]
confInt95_upp <- testResult$conf.int[2]
confInt95_upp
cat('confidence interval = [',confInt95_low,confInt95_upp,']')
margin_of_error <- (confInt95_upp - confInt95_low) / 2
cat('margin of error: ',margin_of_error)
```

For example in our optical dataset, if we do a t.test over that variable we can get the confidence intervals for a level of confidence 95%:

```{r, fig.align='center'}
file1 <- here::here("data","optical_sample.xlsx")
optical_sample <- read_excel(file1)

testResult <- t.test(optical_sample$eye_difference)

confInt95_low <- testResult$conf.int[1]
confInt95_upp <- testResult$conf.int[2]

margin_of_error <- (confInt95_upp - confInt95_low) / 2 

ggplot(optical_sample, aes(x = eye_difference)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(optical_sample$eye_difference),
             linetype = "dashed")+
  geom_vline(xintercept = confInt95_low,
           linetype = "dotted", color = "green")+
  geom_vline(xintercept = confInt95_upp,
           linetype = "dotted", color = "green")
```

the margin of error will be:

```{r}
margin_of_error <- (confInt95_upp - confInt95_low) / 2 
margin_of_error
```

Now we can calculate the confidence intervals for other 90% and 99% level of confidence and see their differences in a graph:

```{r, fig.align='center'}
#level of confidence 90%. 
testResult <- t.test(optical_sample$eye_difference,
       conf.level = .90)
confInt90_low <- testResult$conf.int[1]
confInt90_upp <- testResult$conf.int[2]

#level of confidence 99%. 
testResult <- t.test(optical_sample$eye_difference,
                   conf.level = .99)
confInt99_low <- testResult$conf.int[1]
confInt99_upp <- testResult$conf.int[2]

ggplot(optical_sample, aes(x = eye_difference)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(optical_sample$eye_difference),
             linetype = "dashed")+
  geom_vline(xintercept = confInt90_low,
             linetype = "dotted", color = "orange")+
  geom_vline(xintercept = confInt90_upp,
             linetype = "dotted", color = "orange")+
  geom_vline(xintercept = confInt95_low,
           linetype = "dotted", color = "green")+
  geom_vline(xintercept = confInt95_upp,
           linetype = "dotted", color = "green")+
  geom_vline(xintercept = confInt99_low,
             linetype = "dotted", color = "yellow")+
  geom_vline(xintercept = confInt99_upp,
             linetype = "dotted", color = "yellow")
```

Three factors feed into the size of the margin of error:

1.  The **confidence level of the interval**. That is, the proportion of the time our interval would correctly capture the parameter of interest. Higher confidence requires a larger margin of error.

2.  The **spread of the observations** in the data set. More spread in the data implies more sampling variability and therefore a larger margin of error.

3.  The **size of the sample**.

::: exercise-box
Exercises:

Using the pm_sample data set,

1.  Find a 95% confidence interval for the mean salary of the project managers in this population. Interpret the results in plain language. Would a 99% confidence interval be wider of more narrow?

2.  Find a 95% confidence interval for the man non-salary compensation of project managers in this population. Why is the margin of error different than in the first problem?

```{r}
file <- here::here("data", "pm_survey.xlsx") 
pm_survey <- read_excel(file) 

kable(head(pm_survey))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
as.data.frame(report(pm_survey$annual_salary))

t.test(pm_survey$annual_salary) 

```

The confidence interval at the (default) 95% is `r t.test(pm_survey$annual_salary)$conf.int` which means in plain language that if we were to repeat the sampling and the test multiple times, this confidence interval would capture the true value of the parameter 95% of the time. Although not extrictly true, we can say that there's a 95% chance that this confidence interval includes the true population mean.

For the second exercise

```{r}
pm_survey <- pm_survey |>    
  mutate(other_monetary_comp = replace_na(other_monetary_comp, 0))  

as.data.frame(report(pm_survey$other_monetary_comp))

t.test(pm_survey$other_monetary_comp)  
```

our confidence interval is `r t.test(pm_survey$other_monetary_comp)$conf.int` and so, the margin of error in both examples are:

```{r}
monetary_marginError<- (t.test(pm_survey$annual_salary)$conf.int[2] - 
                          t.test(pm_survey$annual_salary)$conf.int[1]) / 2 
monetary_marginError

nonMonetary_marginError<- (t.test(pm_survey$other_monetary_comp)$conf.int[2] - 
                             t.test(pm_survey$other_monetary_comp)$conf.int[1]) / 2
nonMonetary_marginError
```

The difference in both is due to the variability of the data, that we can calculate using the standar deviation:

```{r}
sd(pm_survey$annual_salary) 
sd(pm_survey$other_monetary_comp)
```
:::

### Interpreting confidence intervals: practical example

We are going to calculate the confidence interval for 100 samples of 30 observations calculated over the same population and plot it on a graph with a line marking the true mean. We will see how some of them (for a 95% confidence level it should be around 5% of the times) will still miss the population parameter:

```{r, fig.align='center', fig.width=6}

#calculate the sample mean and confidence interval from a sample of 100 
low = numeric()
high = numeric()
for (n in 1:100){
  sample <- slice_sample(optical, n = 30)
  test <- t.test(sample$eye_difference)
  low[n] <- test$conf.int[1]
  high[n] <- test$conf.int[2]
}

ci_reps <- data.frame("replicate" = 1:100,
                      low,
                      high)
ggplot(ci_reps, aes(x = low, 
                    xend = high,
                    y = replicate, 
                    yend = replicate)) + 
  geom_segment() +
  geom_vline(xintercept = mean(optical$eye_difference), 
             linetype = "dashed")
```

When interpreting a confidence interval, remember that not all values in an interval are equal. The center is always more likely than the ends.

# Significance testing

A t-significance test is used to consider whether an individual parameter value of interest is plausible in light of sample data.

The situation in which the parameter has that value is referred to as the **null hypothesis**, and the situation in which it does not is referred as to the **alternative hypothesis**. For example: could be the difference between left eye value and right eye value actually be zero? Does a certain large company pay better than the national average, or could salaries there just be different through random chance?

A *test statistic* measures how far away the data in our sample are from what we would expect if the null hypothesis $H_0$ were true.

The most common test statistic is the *z*-statistic. It determines how far an observed value is from the expected value, measured in standard errors.We already saw this formula before (@eq-zscore)

$$
z= \frac{observed - expected}{SE}
$$ {#eq-zscore3}

Observed is a statistic that is appropriate for assessing $H_0$. For example, if we toss a coin ten times and want to know if it is a fair coin, the appropriate statistic would be the number of tails or the percentage of tails.

Expected and SE are the expected value and the SE of this statistic computed under the assumption that $H_0$ is true.

::: exercise-block
Example:

We toss a coin 10 times, and get 7 tails. Using the formulas for the expected values of sums we have: Number of expected tails if the coin is fair:

For a binomial distribution we have: $E=n * p$ where $n$ is the number of repetitions and $p$ is the the probability of success (0,5), and the standard error for binomial scenarios is $SE = \sqrt{np(1-p)}$ .

$$
expected = 10 * \frac{1}{2} = 5
$$

$$
SE = \sqrt{10}\sqrt{\frac{1}{2} * \frac{1}{2}} =1.58
$$

$$
z= \frac{7-5}{1.58}=1.27
$$

By the central limit theorem, the $p$-value can be computed with normal approximation.

```{r}
z <- 1.27

# Calculate the cumulative probability up to z
(p_value <- (1 - pnorm(z))*2)

```

In other words, we have a 20.4% probability of observing 7 tails in 10 tosses given that the coin is fair.
:::

If the $z$-value is large, that means that there is a great difference between the observed and the expected, so large values of $z$ are evidence against $H_0$. The strength of the evidence is measured by the $p$-value or observed significance level.

The $p$-value is the probability of getting a value of $z$ as extreme or more extreme than the observed, assuming the null hypothesis is true. As we can see in the graph below, as $z$ becomes larger, there is less probability of finding data outside of its limits, so the $p$-value will be smaller. We multiply the $p$ value times 2 because we have to consider the two tails of the graph for this experiment.

The $z$-score tells you how far an observed value is from the expected value in terms of standard errors. The $p$-value tells you the probability of observing a test statistic as extreme as, or more extreme than, the observed statistic, under the null hypothesis.

```{r, fig.align='center', echo=FALSE}
#| echo = FALSE
x <- seq(-4, 4, length=100)

# Calculate the corresponding y values for the standard normal distribution
y <- dnorm(x)

# Plot the bell curve
plot(x, y, type="l", lwd=2, col="blue", xlab="", ylab="Density", main="Standard Normal Distribution", xaxt='n')

# Highlight the area before -1.27
polygon(c(-4, seq(-4, -1.27, length=100), -1.27), 
        c(0, dnorm(seq(-4, -1.27, length=100)), 0), 
        col=rgb(0.1, 0.1, 0.9, 0.2), border=NA)

# Highlight the area after 1.27
polygon(c(1.27, seq(1.27, 4, length=100), 4), 
        c(0, dnorm(seq(1.27, 4, length=100)), 0), 
        col=rgb(0.1, 0.1, 0.9, 0.2), border=NA)

# Add vertical lines at z = ±1.27
abline(v = 1.27, col = "red", lwd = 2, lty = 2)
abline(v = -1.27, col = "red", lwd = 2, lty = 2)

# Add annotation at z = 1.27
text(1.27, dnorm(1.27), labels = "z = 1.27", pos = 4, col = "red")
text(-1.27, dnorm(-1.27), labels = "z = -1.27", pos = 4, col = "red")


```

The **result of a significance test is a** $p$-value, which measures how plausible the value of interest is given the sample data. A $p$-value close to zero indicates the value isn't compatible with the data. As a general rule, if `p<0.05`, the value can be considered questionable.

Ideally, you should identify a value of interest (the technical term is *null hypothesis*) before collecting the data. While this isn't always done in practice, doing so helps prevent wrong conclusions that come from the shotgun effect

> If you run a test because you observed an interesting pattern in your sample data, the overall chance of encountering a significant result increases. Data is like firing a shotgun with many pellets---some are bound to hit the target purely due to randomness, if you create your null theory based on a characteristic observed in your sample you are likely going to find significance, even if that is not characteristics of the overall population and it was there just by chance.

While a $p$-value might tell you that a parameter is different from a hypothesized value, it won't ever say how different it might be or if that difference is important.

A *t-significance test* is appropriate for quantitative data under the exact same circumstances as a t-confidence interval: unless the sample is very small and the data is highly non-symmetric. If the data is relatively symmetric and there are no extreme outliers, `n=10` is usually enough. Regardless of the shape of the data, `n=30` is a safe threshold.

The value of $\mu$ (mu) is the value we want to test against (our null hypothesis)

```{r}
file1 <- here::here("data", "optical_sample.xlsx")
optical_sample <- read_excel(file1)

testResult <- t.test(optical_sample$eye_difference, mu = 0) 

testResult$p.value
```

$p$-value express the probability of observing the values we have in the sample if the eye difference of the population were in fact 0. The smaller the $p$-value, the smaller the probability. Usually we take the threshold of 5% or $p$-value \<0.05 to reject the null theory, as it means that it is unlikely that the mean observed in our sample can come from a population whose true mean is 0.

We can change the null theory to whatever value we want to measure against. For example, can we say that the average age of the population is NOT 59?

```{r}
testResult <- t.test(optical_sample$Age, mu = 59)
testResult$p.value 
```

our $p$-value is `{r}testResult$p.value` which does not allow us to reject the null theory.

::: {#SignificanceTestQuickFacts .callout-orange}
Quick Facts

1.  A $p$-value below 0.05 doesn't mean there's a 5% chance the null hypothesis is true. Instead, it means if the null were true, there's a 5% chance of observing data as extreme as, or more extreme than the one that was observed in the sample.

2.  Statistical power, the ability to detect a true effect is often overlooked. A study can have a high chance of missing real effects (Type II error) even if its $p$-value threshold is stringent.

3.  In proportional analysis, Simpson's Paradox can occur where a trend seen in several groups reverses when the groups are combined. It underscores the importance of scrutinizing aggregated data.
:::

### One sided vs two sided test.

One has to carefully consider whether the alternative should be one-sided or two-sided test (we are considering the values on the left of -z and the right of z) or only the values on one side. If we are considering a two sided test, the $p$-value gets doubled. It is not ok to change the alternative afterwards in order to get the $p$-value under 5%.

Deciding whether to use a one-sided or two-sided t-test depends on the hypothesis you want to test. Here are some guidelines to help you make that decision:

Use a **one-sided t-test** when you have a specific direction in mind for your hypothesis. This means you are testing whether the mean is either greater than or less than a certain value, but not both.

**Examples**:

-   **Greater than**: You want to test if the mean score of a new teaching method is greater than the mean score of the traditional method.

-   **Less than**: You want to test if the mean time to complete a task using a new software is less than the mean time using the old software.

Use a **two-sided t-test** when you are interested in any difference from the specified value, regardless of direction. This means you are testing whether the mean is different from a certain value, without specifying the direction of the difference.

**Examples**:

-   You want to test if the mean weight of a sample of apples is different from a known standard weight.

-   You want to test if the mean score of a new drug is different from the mean score of a placebo.

How to Decide

1.  **Define Your Research Question**: Clearly state what you are trying to find out.

2.  **Determine the Direction of Interest**: Decide if you are only interested in deviations in one direction (one-sided) or in both directions (two-sided).

3.  **Formulate Your Hypotheses**: Based on your research question and direction of interest, formulate your null and alternative hypotheses.

**Example Scenario**

Suppose you are testing a new drug and want to know if it has a different effect on blood pressure compared to a placebo. If you only care whether the drug lowers blood pressure, you would use a one-sided test. If you care whether the drug either lowers or raises blood pressure, you would use a two-sided test.

```{r}
# Sample data
sample_data <- c(78, 82, 85, 90, 76, 79, 81, 77, 74, 88)

# Perform one-sided t-test
t_test_result <- t.test(sample_data, mu = 75, alternative = "greater")

# Print the result
print(t_test_result)

# Perform two-sided t-test
t_test_result <- t.test(sample_data, mu = 75, alternative = "two.sided")

# Print the result
print(t_test_result)

```

### Common errors in significance testing {#testErrors}

At the start of a significance test, the hypothesized value might be true or false. At the end, it may be rejected or not, but we still don't know for sure if $H_0$ is true or not. If the null hypothesis holds, then we say that it is a *true null hypothesis* otherwise is a *false null hypothesis* Altogether, there are four possible combinations of these outcomes:

+-----------------------+-----------------------------------+-------------------------------------+
|                       | $H_0$ is true                     | $H_0$ is false                      |
+======================:+:=================================:+:===================================:+
| $H_0$ is rejected     | **Type I error** -false positive- | True Negative                       |
+-----------------------+-----------------------------------+-------------------------------------+
| $H_0$ is not rejected | True Positive                     | **Type II error** -False negative - |
+-----------------------+-----------------------------------+-------------------------------------+

We call the *event* of rejecting the null hypothesis, when it is in fact true, a *Type I error*, we call the *probability* of making a Type I error, the *Type I error rate*, and we say that rejecting the null hypothesis when the $p$-value is less than $\alpha$, *controls* the Type I error rate so that it is equal to $\alpha$

We may observe a $p$-value higher than alpha even when the null hypothesis is false. This is a Type II error. We ask ourselves this question: How big does $N$ have to be in order to detect that the absolute value of the difference is greater than zero? Type II error control plays a major role in designing data collection procedures before you actually see the data, so that you know the test you will run has enough sensitivity or power. Power is 1 minus Type II error rate, or the probability that you will reject the null hypothesis when the alternative hypothesis is true.

There are several aspects of a hypothesis test that affect its power for a particular effect size. Intuitively, setting a lower $\alpha$ decreases the power of the test for a given effect size because the null hypothesis will be more difficult to reject (for example from 0.05 to 0.01). This means that for an experiment with fixed parameters (i.e., with a predetermined sample size, recording mechanism, etc), the power of the hypothesis test trades off with its Type I error rate, no matter what effect size you target.

::: {.callout-orange, appearance="simple", icon="false"} Testing a Hypothesis

Conducting a hypothesis test typically proceeds in four steps. First, we define the null and alternative hypotheses. Next, we construct a test statistic that summarizes the strength of evidence against the null hypothesis. We then compute a $p$-value that quantifies the probability of having obtained a comparable or more extreme value of the test statistic under the null hypothesis. Finally, based on the $p$-value, we decide whether to reject the null hypothesis.

Step 1: Define the Null and Alternative Hypotheses.

The null hypothesis $H_0$ is the default state of belief about the world. The alternative hypothesis $H_a$ represent something different. We can think about rejecting the null hypothesis as making a discovery about our data, but if we fail to reject it, we do not know whether we failed to reject it because or sample was too small or whether we failed to reject it because the null hypothesis holds.

Step 2. Construct the Test Statistic

The way in which we construct our test statistic T depends on the nature of the null hypothesis that we are testing.

Step 3. Compute the $p$-value

The $p$-value is the area under the density function of the null distribution that falls to the right of the T statistic that we calculated in the previous step, and so, gives us the probability of observing a value such us or bigger than the T-Statistic under the assumption that the null hypothesis holds. In general, most commonly-used test statistic follow a well-known statistical distribution under the null hypothesis, such as normal distribution, t-distribution, a $X^2$ distribution or an $F$ distribution, provided that the sample size is sufficiently large and that some other assumptions hold.

The $p$-value allow us to transform our test statistic, which is measured on some arbitrary and uninterpretable scale, into a number between 0 and 1 that can be easily interpreted.

Step 4. Decide whether to reject the null hypothesis.

Once we have computed the $p$-value, it remains for us to decide whether or not to reject the null hypothesis. If the $p$-value is sufficiently small, we want to reject it, but deciding what value is sufficiently small is up to us. :::

### Statistical power

Statistical power is the ability of a test to detect a specified effect. A test with low power is unlikely to rule out a hypothesized value, even if the value is false. That is, it has a high probability of type II error. Increasing the power of an inference technique requires a trade-off of one sort or another. In practice, this usually means using a larger sample. A preliminary power analysis can give decision-makers information about the study size needed to detect an effect of interest.

A very simple example of how to do that: study planners can estimate the sample size needed to reduce the margin of error in the estimate of a population proportion using this formula: $$
n = \left( \frac{1.96}{2E} \right)^2
$$

where $E$ is the required margin of error. According to this formula, a simple political poll requiring a 3% margin of error would need a sample size of approximately n=1067 respondents.

We are going to use our mice dataset to see how we can calculate the power of our t test for type II errors. We consider that the data in the file is our entire population. If we look at the difference average weight for control vs treatment we appreciate that there is in fact a difference or around 9%:

```{r}
dat <- read.csv("data/mice_pheno.csv") 
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  dplyr::select(Bodyweight) %>% unlist

hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  dplyr::select(Bodyweight) %>% unlist

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
print((mu_hf - mu_control)/mu_control * 100) #percent increase
```

Depending on our sample size, this difference will be perceived by our test or not. Let's see an example with sample size = 12 and alpha = 0.05. We will do first a single test and see that we cannot reject the null hypothesis based on the $p$-value:

```{r}
set.seed(1)
N <- 12
hf <- sample(hfPopulation,N)
control <- sample(controlPopulation,N)
t.test(hf,control)$p.value

```

If we ran this experiment multiple times we can get the percentage of times that our $p$-value manage to actually reject the null hypothesis:

```{r}
repetitions<-2000
N<- 12
alpha<- 0.05
reject<- function(N, alpha=0.05){
  hf <- sample(hfPopulation, N)
  control <- sample(controlPopulation, N)
  pval<- t.test(hf,control)$p.value
  pval < alpha
}

rejections<- replicate(repetitions,reject(N))
mean(rejections)

```

In this case we see that 25% of the time we correctly rejected the null hypothesis and this is the power of our test. To increase this percentage we should increase the sample size. In the next code section we are going to run the same simulation for various values of N

```{r, fig.align='center'}
Ns<- seq(5,50,5)
power <- sapply(Ns,function(N){
  rejections<- replicate(repetitions,reject(N))
  mean(rejections)
})
plot(Ns,power, type="b")
```

There are on the internet several tools that allow you to calculate the the power you need for a particular standard deviation, sizes of n or the effect size you want to detect.

When performing a hypothesis test over a large number of data points, the sample size can significantly impact the $p$-value.

**Influence of Sample Size on the p-value**:

*Increased Sensitivity*: A larger sample size increases the power of the test, meaning it's more likely to detect a true effect if one exists. This is because with more data, we get a clearer picture of the population, reducing the standard error.

*Reduced Standard Error*: The standard error of the mean decreases as the sample size increases. This makes the test statistic (such as the $t$-score or $z$-score) larger if there is an effect, which can lead to a smaller $p$-value.

*Smaller p-values*: With a large sample size, even small differences from the null hypothesis can result in statistically significant $p$-values. This is because the test becomes more sensitive to detecting small effects.

Risk of Overinterpretation: While small $p$-values indicate statistical significance, they do not necessarily imply practical significance. In large datasets, it's crucial to consider the effect size and practical relevance of the results.

Example: Consider a hypothesis test to determine whether a new drug has a different effect than a placebo. If you have a very large sample size, even a tiny difference in outcomes between the drug and placebo groups may produce a statistically significant $p$-value, even if the difference is not clinically meaningful.

Visual Explanation: Imagine plotting the p-value as a function of sample size. Initially, as the sample size increases, the p-value decreases sharply, indicating stronger evidence against the null hypothesis. However, beyond a certain point, the $p$-value becomes very small for even minute differences, highlighting the need to interpret results in context.

Large Sample Size: Leads to smaller $p$-values, higher test sensitivity, and potential for detecting trivial differences as significant.

Interpret Results Carefully: Always consider effect size and practical significance, not just statistical significance.

```{r, fig.align='center'}

# Set seed for reproducibility
set.seed(123)

# Generate synthetic data
small_sample <- rnorm(30, mean = 0.03, sd = 1)
large_sample <- rnorm(2000, mean = 0.03, sd = 1)

# Function to calculate p-values for increasing sample sizes
calculate_p_values <- function(data, max_size) {
  p_values <- vector("numeric", max_size)
  for (i in 2:max_size) {  # Start from 2 to avoid t-test error
    sample_data <- data[1:i]
    test_result <- t.test(sample_data, mu = 0)
    p_values[i] <- test_result$p.value
  }
  return(p_values)
}

# Calculate p-values for both small and large samples
small_sample_p_values <- calculate_p_values(small_sample, length(small_sample))
large_sample_p_values <- calculate_p_values(large_sample, length(large_sample))

# Combine results into a data frame
p_values_df <- data.frame(
  Sample_Size = c(1:length(small_sample), 1:length(large_sample)),
  P_Value = c(small_sample_p_values, large_sample_p_values),
  Sample_Type = rep(c("Small_Sample", "Large_Sample"), c(length(small_sample), length(large_sample)))
)

# Filter out NA values
p_values_df <- p_values_df %>% filter(!is.na(P_Value))

# Plot the p-values with facets
ggplot(p_values_df, aes(x = Sample_Size, y = P_Value, color = Sample_Type)) +
  geom_line() +
  labs(title = "Influence of Sample Size on P-Value",
       x = "Sample Size",
       y = "P-Value",
       color = "Sample Type") +
  theme_minimal() +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
  scale_color_manual(values = c("Small_Sample" = "blue", "Large_Sample" = "green")) +
facet_wrap(~ Sample_Type, scales = "free_x", ncol = 1)+
  ylim(0, 1)


```

In the graphs above we show how even a very tiny deviation (our data's $\mu =0.03$) from our null hypothesis ($\mu =0$) is showing as significant when the sample is above 1500 data points, the graph below shows the test over only a maximum of 30 data points and never detects the same difference as significant.

### Student's t test

So far we have been using the t-test without explaining the *student's t-distribution* that is used under the hood. Let's explain it now.

We will explain the student's t test with an example: The health guideline for lead in drinking water is a concentration of not more than 15 parts per billion (ppb). Five independent samples from a reservoir average 15.6 ppb. Is this sufficient evicence to conclude that the concentration $\mu$ in the reservoir is above the standard of 15 ppb?

Our null hypothesis is that no, the concentration is just on the standard $H_0:\mu=15$ and the alternative hypothesis is that the concentration is higher than 15 ppb.

$$
z= \frac{observed - expected}{SE} = \frac{15.6-15}{SE}
$$

SE of average = $\frac{\sigma}{\sqrt{n}}$ but sigma is unknown. By the boostrap principle we know that we should be able to substitute the standard deviation of the population $\sigma$ with the standard deviation of our sample $s$ however for sample size less or equal to 20, then the normal curve is not a good enough approximation to the distribution of the $z$-statistic. Rather, an appropriate approximation is the Student's t-distribution with n-1 degrees of freedom.

```{r, fig.align='center', fig.height=4, fig.width=5}
#| echo = FALSE


# Create a sequence of x values
x_values <- seq(-4, 4, by = 0.01)

# Calculate the density of the t-distribution for different degrees of freedom
df1 <- dt(x_values, df = 1)
df2 <- dt(x_values, df = 2)
df5 <- dt(x_values, df = 5)
df_inf <- dnorm(x_values) # Normal distribution as t-distribution with infinite degrees of freedom

# Create a data frame for plotting
data_to_plot <- data.frame(
    x = c(x_values, x_values, x_values, x_values),
    density = c(df1, df2, df5, df_inf),
    degree_of_freedom = factor(c(rep(1, length(df1)), rep(2, length(df2)), rep(5, length(df5)), rep('∞', length(df_inf))))
)

# Plot using ggplot
ggplot(data_to_plot, aes(x = x, y = density, color = degree_of_freedom)) +
    geom_line() +
    labs(title = "Student's t-Distribution", x = "x", y = "Density") +
    scale_color_discrete(name = "ν")

```

In the graph we can see the purple line where the degrees of freedom are high (large sample) is just the normal curve. The rest of the lines are showing how with less degrees of freedom the tails of the curve get bigger, representing higher uncertainty. We saw the **sample standard deviation formula** (@eq-sampleStandarDeviation):

$$
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2}
$$

In this case we also need to adjust our **confidence interval** formula from: $CI=estimate\mp z\  SE$

to $$
CI= estimate (\mu) \mp t_{n-1}SE
$$ {#eq-confidenceIntervalTStudent}

::: {.callout-orange appearance="simple" icon="false"}
::: centered-text
**z-test vs t-test**
:::

The main differences between a t-test and a $z$-test lie in their assumptions and applications:

**T-Test**

-   **Distribution**: Uses the Student's t-distribution.

-   **Population Variance**: Unknown and estimated from the sample.

-   **Sample Size**: Typically used for small sample sizes (n \< 30).

-   **Degrees of Freedom**: Required for the calculation.

-   **Application**: Used when comparing the means of two groups, especially when the sample size is small and the population variance is unknown.

**Z-Test**

-   **Distribution**: Uses the standard normal distribution (z-distribution).

-   **Population Variance**: Known.

-   **Sample Size**: Typically used for large sample sizes (n \> 30).

-   **Degrees of Freedom**: Not required.

-   **Application**: Used for hypothesis testing of means and proportions when the sample size is large and the population variance is known.

[**Key Differences**]{.underline}

1.  **Distribution**:

    -   T-test: Student's t-distribution.

    -   $z$-test: Standard normal distribution.

2.  **Population Variance**:

    -   T-test: Unknown and estimated from the sample.

    -   $z$-test: Known.

3.  **Sample Size**:

    -   T-test: Small sample sizes (n \< 30).

    -   $z$-test: Large sample sizes (n \> 30).

4.  **Degrees of Freedom**:

    -   T-test: Required.

    -   $z$-test: Not required.

[**Example Scenarios**]{.underline}

-   **T-Test**: Comparing the average test scores of two small classes where the population variance is unknown.

-   **Z-Test**: Testing the average height of a large population where the population variance is known.
:::

T-test are the most commonly used test in real life, but if we meet all the criteria to perform a $z$-test we can do it like this in r:

```{r}
library(BSDA)
# Sample data
sample_data <- c(88, 92, 94, 94, 96, 97, 97, 97, 99, 99, 105, 109, 109, 109, 110, 112, 112, 113, 114, 115)
population_mean <- 100
population_sd <- 15

# Perform a one-sample z-test
z_test_result <- z.test(sample_data, mu = population_mean, sigma.x = population_sd)
print(z_test_result)
```

# Categorical data

In 1912 the Titanic sank and more than 1500 of the 2229 people on board died. Did the chances of survival depend on the ticket class?

|          | First | Second | Third | Crew |
|----------|-------|--------|-------|------|
| Survived | 202   | 118    | 178   | 215  |
| Died     | 123   | 167    | 528   | 698  |

This is an example of categorical data. The data are counts for a fixed number of categories. Here the data is tabulated in a 2x4 table. Such table is called a **Contingency Table** because it shows the survival counts for each category of ticket class, i.e. contingent on ticket class.

## Confidence Intervals for Proportions.

Confidence intervals for proportions provide a range of plausible values for population proportions, enabling estimation with a desired level of confidence.

Proportional analysis techniques allow for the assessment and comparison of proportions across different categories, providing valuable insights into categorical data.

There are any number of statistical methods for computing confidence intervals for proportions. By far the simplest is:

$$
p = \hat{p} \pm \sqrt{\frac{\hat{p} \cdot (1 - \hat{p})}{n}}
$$ {#eq-confidenceIntervalProportions}

where $p$ is the actual population proportion and $\hat{p}$ is the observed proportion.

As long as the sample size isn't too small, the difference between methods should be minor. If your sample includes at least 5 of each sort of possible outcomes, you are fine with whatever default method your software uses to calculate the proportions.

In R we will use $prop.test(n_s , sample size)$ where $n_s$ is the number of successes.

Practice:

We are going to calculate the proportion of smokers in our population based on our optical_sample data

```{r}
file1 <- here::here("data", "optical_sample.xlsx")
optical_sample <- read_excel(file1)

as.data.frame(report(optical_sample$IsSmoker))

successNumber<- sum(optical_sample$IsSmoker)
sampleSize <- NROW(optical_sample$IsSmoker)
testResult <- prop.test(successNumber, sampleSize)
testResult

#proportion of success 
proportion<- testResult$estimate %>% print()
conf_int <- testResult$conf.int%>% print()

```

We can use the formula we studied at the beginning of the chapter to manually calculate the confidence intervals as well and see the difference with what the software used:

```{r}
ci_lower<- proportion - sqrt(proportion*(1-proportion)/sampleSize) 
ci_upper<- proportion + sqrt(proportion*(1-proportion)/sampleSize) 
ci_lower
ci_upper
```

Another way of manually calculating the confidence interval is using a $z$-score

In the example below we use 95%: (qnorm(0.975) is the $z-score$ corresponding for 95% conf. level)

```{r}
margin_of_error <- qnorm(0.975) * sqrt(proportion * (1 - proportion) / sampleSize)

# Calculate the confidence interval bounds
proportion - margin_of_error
proportion + margin_of_error
```

For different confidence levels you will need to find the correct $z$-score

### Non-binary categorical variables

If our categorical variable is not binary, we still can use this same test to calculate the proportion of one single category against the rest.

For example, to calculate the proportion of women in our pm_survey dataset:

```{r}
file2 <- here::here("data", "pm_survey.xlsx")
pm_survey <- read_excel(file2)

kable(head(pm_survey))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()
as.data.frame(report(pm_survey$gender))

pm_survey <- pm_survey |> 
  mutate(across(.cols = c(highest_level_of_education_completed, gender), as.factor))

numberOfWomen<- NROW(pm_survey[pm_survey$gender == 'Woman',])
sampleSize <- NROW(pm_survey)
prop.test(numberOfWomen, sampleSize)

```

### Few observations for each outcome

While there are statistical methods for dealing with samples with fewer than 5 or each sort of outcome, you should be conservative about making decisions based on them. That said, there are two methods that might be helpful in such situations:

-   The wilson score interval works well even for observed proportions near zero or one and has other good theoretical properties as well. Unlike other confidence intervals we've seen it isn't symmetric.

-   The rule of three is a great rought-and-ready way to compute a 95% confidence interval for a proportion when no positive results have been observed. If you have failures only, the interval will go from 0 to 3/n where n is the sample size, and if you have observed successes only, the confidence interval will go from 1 - (3/n) to 0

## Significance testing for proportions

We want to know if the proportion of the people taking medication in our sample is the same as the proportion of people in USA taking medication. This data we know is 66% of the USA population so we can plug in that value in our test just like that in the formula using $p=$ where p, in this case is the proportion for our null theory.

```{r}
file <- here::here("data", "optical_sample.xlsx")
optical_sample <- read_excel(file)

as.data.frame(report(optical_sample$TakingMedication))

medicated<- sum(optical_sample$TakingMedication == "yes")
sampleSize <- nrow(optical_sample)

testResult <- prop.test(medicated, sampleSize, p = .66)

testResult$p.value
testResult$conf.int
```

In our case we can reject the null hypothesis and say that it is not reasonable to believe that this sample was drawn from a population with sample mean of 66%

::: exercise-box
Exercises

**Exercise 1.** A media outlet claims that 60% of project managers in the U.S have a college degree as their highest level of education. Is that claim plausible using our pm_survey data set?

```{r}
file <- here::here("data", "pm_survey.xlsx")
pm_survey <- read_excel(file)

pmWithCollegeDegree <- sum(pm_survey$highest_level_of_education_completed == 'College degree')
sampleSize <- nrow(pm_survey)

testResult <- prop.test(pmWithCollegeDegree, sampleSize, p= 0.6)

testResult$p.value
testResult$conf.int
```

In this case we cannot reject the null hypothesis.

**Exercise 2**. Use that data set to construct a 95% confidence interval for the proportion of project managers that are under the age of 35.

```{r}
pm_under35 <- nrow (pm_survey[pm_survey$how_old_are_you %in% c("18-24","25-34"),])

testResult <- prop.test(pm_under35, sampleSize)
testResult$conf.int 
testResult$estimate
```

The confidence interval shows that between 44% and 57% of respondents are under 35
:::

## Goodness of Fit{#sec-goodness-of-fit}

Consider genetic data where you have two groups of genotypes (A or B) for cases and controls for a given disease. The statistical question is if genotype and disease are associated. As in the examples we have been studying previously, we have two populations (A and B) and then numeric data for each, where disease status can be coded as 0 or 1. So why can't we perform a t-test? Note that the data is either 0 (control) or 1 (cases). It is pretty clear that this data is not normally distributed so the t-distribution approximation is certainly out of the question. We could use CLT if the sample size is large enough, otherwise, we can use *association tests*.

Imagine we have 250 individuals, where some of them have a given disease and the rest do not. We observe that 20% of the individuals that are homozygous for the minor allele (group B) have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?

```{r}
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("A",200),rep("B",50)),
                levels=c("A","B"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up

tab<- table(genotype,disease)
tab
```

The typical statistics we use to summarize these results is the odds ratio (OR). We compute the odds of having the disease if you are an "B": 10/40, the odds of having the disease if you are an "A": 20/180, and take the ratio: $(10/40) / (20/180)$

```{r}
(tab[2,2]/tab[2,1]) / (tab[1,2]/tab[1,1])
```

To compute a $p$-value, we don't use the `OR` directly. We instead assume that there is no association between genotype and disease, and then compute what we expect to see in each cell of the table under the null hypothesis: ignoring the groups, the probabilty of having the disease according to our data is:

```{r}
p=mean(disease=="cases")
p
```

according to this probability, under our null hypothesis we expect to see a table like this:

```{r}
expected <- rbind(c(1-p,p)*sum(genotype=="A"),
                  c(1-p,p)*sum(genotype=="B"))
dimnames(expected)<-dimnames(tab)
expected
```

The Chi-square test uses an asymptotic result (similar to the CLT) related to the sums of independent binary outcomes. Using this approximation, we can compute the probability of seeing a deviation from the expected table as big as the one we saw. The $p$-value for this table is:

```{r}
chisq.test(tab)$p.value
```

so we would expect to find this difference in disease ratio by chance approximately 8.8% of the times, which is not enough to reject our null hypothesis.

**Large Samples, Small** $p$-values

Reporting only $p$-values is not an appropriate way to report the results of your experiment. Studies with large sample sizes will have impressively small $p$-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1.

To demonstrate, we recalculate the $p$-value keeping all the proportions identical, but increasing the sample size by 10, which reduces the $p$-value substantially:

```{r}
tab<-tab*10
chisq.test(tab)$p.value
```

::: {.callout-orange appearance="simple" icon="false"}
1.  **Fitting Fallacies**: Goodness of fit doesn't guarantee predictive power. A model can fit past data perfectly yet fail miserably on new, unseen data, highlighting the dangers of overfitting.

2.  **The Sample Size Paradox**: Doubling your sample size doesn't necessarily halve the error. In fact, to do so, you'd need to quadruple the sample, given the square root relationship between sample size and margin of error.

3.  **Two-Sample Twists**: When comparing two samples, it's possible for each sample's individual data to appear random, yet their combined data can reveal a distinct pattern or significant difference.
:::

A Goodness-of-fit test, also called chi-squared or Pearson goodness-of-fit test, considers whether a categorical variable has a hypothesized distribution

Warning! **A goodness of fit test doesn't give any information about which specific categories might be out of line with the hypothesized distribution.** While you might be able to make an educated guess by looking at the data, this test shouldn't be used to support that kind of intuition.

In 208 the manufacturer of M&Ms published their last color distribution:

| Blue | Orange | Green | Yellow | Red | Brow |
|------|--------|-------|--------|-----|------|
| 24%  | 20%    | 16%   | 14%    | 13% | 13%  |

We open several packages of M&Ms and count the colors:

| Blue | Orange | Green | Yellow | Red | Brow |
|------|--------|-------|--------|-----|------|
| 85   | 79     | 56    | 64     | 58  | 68   |

Are these counts consistent with the last published percentages? is there sufficient evidence to claim that the color distribution is different? This question requires a test of goodness-of-fit for the six categories.

Our null hypothesis is that the color distribution is the same. The idea is to compare the observed counts to the numbers we would expect if \$H_0 \$ is true, so for our sample of 410 M&Ms we would expect:

| Blue | Orange | Green | Yellow | Red  | Brow |
|------|--------|-------|--------|------|------|
| 98.4 | 82     | 65.6  | 57.4   | 53.3 | 53.3 |

We look at the difference of the observed and the expected values, we square that difference and we standardize it by dividing by the expected

$$
\chi^2 =\sum_{categories} \frac{(observed-expected)^2}{expected}
$$

$$
\frac{(85 - 98.4)^2}{98.4} + \frac{(79 - 82)^2}{82} + \cdots + \frac{(68 - 53.3)^2}{53.3} = 8.57
$$

Large values of the chi-square statistic $\chi^2$ are evidence against the null hypothesis. To calculate the $p$-value we use the chi-square distribution. The $p$-value is the right tail of the $\chi^2$ distribution with degrees of freedom = number of categories -1.

```{r, fig.align='center'}
df <- 5
# Create the chi-square distribution curve
curve(dchisq(x, df = df), from = 0, to = 40, 
      main = paste("Chi-Square Distribution (df =", df, ")"),
      ylab = "Density", lwd = 2, col = "steelblue")

# Draw a vertical line at x = 8.57
abline(v = 8.57, col = "red", lwd = 2, lty = 2)

# Highlight the area to the right of x = 8.57
x_vector <- seq(8.57, 40, length.out = 1000)
y_vector <- dchisq(x_vector, df = df)
polygon(c(8.57, x_vector, 40), c(0, y_vector, 0), 
        col = adjustcolor("red", alpha.f = 0.3), border = NA)
```

in our example, this $p$-value happens to be 12.7%, which does not allow us to reject the null hypothesis. In r we can calculate the $p$-value like this:

```{r}
# Set the chi-square value and degrees of freedom
x <- 8.57
df <- 5

# Calculate the p-value
p_value <- pchisq(x, df, lower.tail = FALSE)

# Print the p-value
p_value
```

::: exercise-box
Exercise:

In the exercise below we want to know if the age population in our project managers survey data matches the distribution of ages for the USA population. We get the distribution of USA from wikipedia and do some adjustments so the ranges matches those in our survey:

```{r}
file1 <- here::here("data", "pm_survey.xlsx")
pm_survey <- read_excel(file1)

us_ages <- c(.117, .176, .168, .158, .166, .215)

table(pm_survey$how_old_are_you)
obs_ages <- as.numeric(table(pm_survey$how_old_are_you))

testResult <- chisq.test(obs_ages, p = us_ages)
testResult %>% report()

testResult$p.value
```

The very small $p$-value indicates that is extremely unlikely that our pm_survey data was extracted at random from a population with the us_ages distribution.

As mentioned, the goodness of fit test does not tell us what categories show the discrepancies in the data, if we want to find out what is the difference between our sample range of ages and the USA data we can just subtract the proportions from each and see what categories are sub represented and vice versa

```{r}
obs_ages / sum(obs_ages)
obs_ages / sum(obs_ages) - us_ages
```

Now we want to see if in our optical sample the customers are evenly distributed between the opticians. If we don't pass a $p$ attribute to the chi squared test it will assume you are asking for even distribution between all categories:

```{r}
file2 <- here::here("data", "optical_full.xlsx")
optical_full <- read_excel(file2)

counts <- as.numeric(table(optical_full$`Optician Last Name`))

testResult<- chisq.test(counts)
testResult %>% report()

testResult$p.value

```

In this case we cannot reject the null hypothesis, which in this case is that the count of patients is evenly distributed among all opticians.
:::

::: exercise-box
Exercise

Does the distribution of populations of towns in the United States follow [Benford's law](https://en.wikipedia.org/wiki/Benford%27s_law)? Check using the towns data set.

```{r, fig.align='center'}
theme_set(theme_minimal())

file <- here::here("data", "towns.xlsx")
towns <- read_excel(file)

benford <- c(.301, .176, .125, .097, .079, .067, .058, .051, .046)
sum(benford)
#we calculate the frequency of each first digit in the population of towns.
table(towns$first_digit)
digits_freq <- as.numeric(table(towns$first_digit))
digits_freq

testResult<- chisq.test(digits_freq, p = benford)
testResult %>% report()
testResult$p.value


# An illustrative plot
benford_df <- data.frame(distribution = c(rep("Towns", 9), rep("Benford", 9)),
                         first_digit = rep(1:9, 2),
                         frequency = c(digits_freq/sum(digits_freq), benford))

ggplot(benford_df, aes(x = first_digit, 
                       y = frequency,
                       fill = distribution)) + 
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:9) + 
  labs(x = "First digit",
       y = "Relative frequency",
       fill = "Distribution") +
  scale_fill_brewer(palette = "Dark2")

```

our $p$-value indicates that we cannot reject the null hypothesis, meaning in this case that the distribution of the first digit in towns across US is compatible with Benford's law.
:::

# Two-Sample testing

Two-sample testing allows for the comparison of two independent groups or populations to assess if there are statistically significant differences between their means, proportions, or other relevant measures.

*Confidence intervals* for two-sample comparison provide a range of plausible values for the difference in means or proportions, allowing for estimation with a desired level of confidence.

*Significance testing* for two-sample comparison involves evaluating the evidence against the null hypothesis and determining if the observed differences between groups are statistically significant.

In this simple exercise we are going to generate ratings for two products at random. We have not changed the probability of each ranking so both products should have the same rating so the difference of their mean ratings should be 0 if we have enough samples.

Creating many samples at random, calculating their differences and plotting those differences will show how, although we can see that the bell curve distributes more or less evenly from 0 as expected, many of the samples showed a difference

```{r, fig.align='center'}
theme_set(theme_minimal())

set.seed(27)

rating <- sample(1:5, 27, replace = TRUE)
product <- c(rep("A", 15), rep("B", 12))
AB_testing <- data.frame(product,
                         rating)

AB_testing |> 
  group_by(product) |> 
  dplyr::summarize(avg_rating = mean(rating))

AB_testing |> 
  group_by(product) |> 
  dplyr::summarize(avg_rating = mean(rating)) |> 
  ggplot(aes(x = product, 
             y = avg_rating,
             fill = product)) + 
  geom_col() +
  scale_fill_brewer(palette = "Dark2") +
  theme(legend.position = "none") +
  labs(x = "Product",
       y = "Average rating")


# repeat the samples 10000 times:
difference <- integer()
for (rep in 1:10000){
  rating <- sample(1:5, 27, replace = TRUE)
  difference[rep] <- mean(rating[1:15]) - mean(rating[16:27]) 
}
qplot(difference, binwidth = .2, xlab = "Difference in ratings")
```

In reality we will only have access to one of the samples, and we have to be able to tell if it's reasonable to draw a conclusion about the population just based on the sample or whether or not we should just attribute these sorts of differences to random chance.

## Significance testing for Two-Samples data

### z-test

The significance test for two samples uses the null hypothesis that there is no difference in the means of the two population means. The $p$-value will be used to reject or not that null hypothesys.

we can use a $z$-test for the difference between the two means:

$$
z = \frac{observed\ difference- expected\ difference}{SE\ of\ difference} = \frac{(\bar{x_2} - \bar{x_1})-0}{SE\ of\ difference}
$$

our expected difference is 0 because that's our null hypothesis (no difference). An important fact is that if $\bar{x_1}$ and $\bar{x_2}$ are independent, then:

$$ 
SE(\bar{x_2} - \bar{x_1}) = \sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2}
$$

::: exercise-box
Exercise: two-sample $z$-test (proportions.)

*Last month, the president's approval rating in a sample of 1000 likely voters was 55%. This month, a poll of 1,500 likely voters resulted in a rating of 58%. Is this sufficient evidence to conclude that the rating has changed?*

$\hat{p_1} = 55%$ and $\hat{p_2} = 58%$

$$
z = \frac{(\hat{p_2}-\hat{p_1})-0}{SE_{diff}}=\frac{(\hat{p_2}-\hat{p_1})-0}{\sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2}}
$$ The formula for the standard error for the proportion is

$$ 
SE= \sqrt{\frac{p(1-p)}{n}}
$$

Calculate the standard errors:

For $\hat{p_1}$: $$
SE(\hat{p_1}) = \sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1}} = \sqrt{\frac{0.55 \times 0.45}{1000}}= \sqrt{\frac{0.2475}{1000}} = \sqrt{0.0002475} \approx 0.0157
$$

For $\hat{p_2}$:

$$
SE(\hat{p_2}) = \sqrt{\frac{\hat{p_2}(1 - \hat{p_2})}{n_2}} = \sqrt{\frac{0.58 \times 0.42}{1500}} = \sqrt{\frac{0.2436}{1500}} = \sqrt{0.0001624} \approx 0.0127
$$

$$
SE_{diff} = \sqrt{(0.0157)^2 + (0.0127)^2} = \sqrt{0.00024649 + 0.00016129} = \sqrt{0.00040778} \approx 0.0202
$$

$$
z = \frac{(0.58 - 0.55) - 0}{0.0202} = \frac{0.03}{0.0202} \approx 1.49
$$

The calculated $z$-value is approximately 1.49. To determine if this is statistically significant, you would compare this $z$-value to the critical value from the standard normal distribution (typically 1.96 for a 95% confidence level). Since 1.49 is less than 1.96, you would fail to reject the null hypothesis at the 95% confidence level, indicating that there is not sufficient evidence to conclude that the president's approval rating has changed significantly.

The $p$-value can be calculated using standard normal distribution tables or software

```{r}
pValue <- pnorm(1.49)
pValue
```

Since this is a two-tailed test (we are checking if the approval rating has changed, not just increased or decreased), we need to consider both tails of the distribution, so we double our values: The $p$-value is calculated as $2 \times (1 - \text{cumulative probability})$. $p = 2 \times (1 - 0.9318) = 2 \times 0.0682 = 0.1364$

The $p$-value for the $z$-value of 1.49 is approximately 0.1364. Since this $p$-value is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means there is not sufficient evidence to conclude that the president's approval rating has changed significantly.
:::

The exercise above shows how to calculate the $p$ value for proportions, if we are working with average values instead of proportions the calculation is the same, only thing to consider is that in this case is that Standard deviation of each individual sample will be calculated using the formula for the standard deviation for the mean $SE(\bar{x_1})= \frac{\sigma_1}{\sqrt{n_1}}= \frac{s_1}{\sqrt{n_1}}$ . If the sample sizes are not large, then the $p$-value needs to be computed from the t-distribution instead.

### The Welch Two Sample t-test

If we have two sets, one control and one test, their means being $\mu_t$ and $\mu_c$ the two-sample t-statistic is defined as:

$$
T = \frac{\hat\mu_t-\hat\mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}}
$$ {#eq-two-sampletstatistic}

where

$$
s=\sqrt{\frac{(n_t-1)s^2_t+(n_c-1)s^2_c}{n_t+n_c-2}}
$$

is an estimator of the pooled standard deviation of the two samples. Here, $s^2_t$ and $s^2_c$ are unbiased estimators of the variance of our metric in the treatment and control group, respectively. A large (absolute) value of T provides evidence against $H_0$

In the example below we are going to use the attrition dataset to see if the average age of employees in two different departments is different?

```{r}
file <- here::here("data", "attrition1.xlsx")
attrition1 <- read_excel(file)

kable(head(attrition1))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()

attrition1 |> 
  group_by(Department) |> 
  dplyr::summarize(avg_age = mean(Age))

testResult <- t.test(Age ~ Department, 
       data = attrition1)

testResult$p.value

report(testResult)

```

## Confidence intervals for Two-Samples data.

We can use the formula for the standard error of the difference to also do a confidence interval calculation. The confidence interval for $p_2-p_1$ is $(\hat{p_2}-\hat{p_1}) \mp z \times SE(\hat{p_2}-\hat{p_1})$

were $z$ is the $z$-score for the confidence level we are interested in.

::: exercise-box
in our example about the voters approval of the president:

the $z$-score value for a confidence level of 95% is 1.959964

$$
58-55 \mp 1.96 \times 0.0202  \approx [-0.0705,0.0105]
$$

If we want to resolve the same exercise using r code it gives similar but not exactly the same results:

```{r}
successes <- c(0.55 * 1000, 0.58 * 1500)

# Define the sample sizes
sample_sizes <- c(1000, 1500)

# Perform the two-sample z-test for proportions
test_result <- prop.test(successes, sample_sizes)

# Print the result
print(test_result)
```
:::

There are many statistical techniques for describing the difference between a single variable across two populations. The most universal is the **Welch two-sample confidence interval**, which is a statistical technique used to compare means between two independent groups, taking into account the unequal variances often encountered in real-world data, so it valid in nearly all circumstances. A few things to bear in mind:

-   It is a multi-sample procedure, not a multi-variable one. Use it when you're asking how much two samples differ in a single variable.

-   Like every other statistical tool in this course, it requires that all the observations be independent of one another (not to be used with time-line analysis)

-   Unless the data has extreme outliers or is highly asymmetric, it will give good results when both samples are of size 10 or more. If the samples are of size at least 30, it is fine under all but the most extreme circumstances.

The Welch two-sample confidence interval does not require that the samples are equal in size or that the populations have equal variances, unlike some other procedures.

We are going to use the AB_testing data we generated in the previous section and calculate the confidence intervals for the differences observed in one of our samples

```{r}
set.seed(27)  
rating <- sample(1:5, 27, replace = TRUE) 
product <- c(rep("A", 15), rep("B", 12)) 
AB_testing <- data.frame(product,rating)  
AB_testing |>    group_by(product)  %>%     
  dplyr::summarize(avg_rating = mean(rating))  

as.data.frame(report(AB_testing))  

testResult<- t.test(rating ~ product, data = AB_testing)  
testResult %>% report()  
testResult$conf.int  
```

The confidence interval says that with 95% confidence, the population mean difference is between `r testResult$conf.int[1]` and `r testResult$conf.int[2]`. This is actually saying that the difference in the means could be 0.

If you know or can assume that the variance of the two populations is equal, then you can use var.equal = TRUE in the t.test, and in this case instead of using a Welch Two sample t-test, it will use a Two sample t-test

```{r}
t.test(rating ~ product,         data = AB_testing,        var.equal = TRUE)  
testResult<- t.test(rating ~ product, data = AB_testing) 
testResult %>% report()  
testResult$conf.int  
```

In the example below we want to use our substance_abuse data set and know if there is a difference in the variable DLA_improvement based on the program that the patient was following. We are assuming that the variance is equal in both cases:

We can change the confidence level manually if we want:

```{r}
file <- here::here("data", "substance_abuse.xlsx") 
substance_abuse <- read_excel(file) 
substance_abuse$DLA_improvement <- substance_abuse$DLA2 - substance_abuse$DLA1 
t.test(DLA_improvement ~ Program,         data = substance_abuse,        conf.level = .99)
```

::: exercise-box
Exercises

**Exercise 1**: *Generate a 95% confidence interval for the difference in average monthly income between the research and development and sales departments of the company*

```{r}
testResult <- t.test(MonthlyIncome ~Department, 
                     data= attrition1)
testResult$conf.int
report(testResult)
```

The confidence interval does not include 0, which means that there is a difference in the means of the populations of both groups

**Exercise 2**: *It is reasonable to claim that the montly rate is the same bewteen these two departments?*

```{r}
testResult <- t.test(MonthlyRate ~Department, 
                      data= attrition1,
                      mu =0)
testResult$p.value
report(testResult)
```

In light of the results we cannot reject the null hypothesis that they have the same monthly rate.
:::

### Pooled estimate:

Going back to our recent exercise of the voters's approval rate, we concluded that the two different surveys did not significantly differed.

Since we could not reject the null hypothesis that the two samples are representing the same approval for the candidate, we can combine the two of them to find a better estimate of our Standard Error

in the first sample we have 0,55 x 1000 voters who approved, in the second sample we have 0.58x 1500 so in total we have 1420 approvals out of 2500 people surveyed, so our ppoled estimate will be $\frac{1420}{2500}=56.8\%$ so now we can calculate the Standard Error using that new value:

$$ 
SE(\hat{p_2}-\hat{p_1}) = \sqrt{\frac{0.568(1-0.568)}{1000}+\frac{0.568(1-0.568)}{1500}}=0.02022 
$$

### Pooled standard deviation:

If we know (or there is reason to assume) that the standard deviation of the two populations is the same $\sigma_1=\sigma_2$ then we can use the pooled estimate:

$$ 
s^2_{pooled}=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2} 
$$

::: {.callout-orange appearance="simple" icon="false"}
**Two-Sample** $z$-Test vs Two sample T-test vs Welch's Two-Sample T-Test

**Two sample** $z$-test

When to Use:

-   **Known Population Variances**: The population variances are known.

-   **Large Sample Sizes**: Typically used when the sample sizes are large (n \> 30).

-   **Normal Distribution**: Assumes that the data follows a normal distribution.

**Two sample T-test**

-   **Unknown Population Variances**: The population variances are unknown and assumed to be equal.

-   **Small Sample Sizes**: Typically used when the sample sizes are small (n \< 30).

-   **Normal Distribution**: Assumes that the data follows a normal distribution.

**Welch's Two-Sample T-Test**

When to Use:

-   **Unknown and Unequal Population Variances**: The population variances are unknown and not assumed to be equal.

-   **Small or Unequal Sample Sizes**: Can be used for small or unequal sample sizes.

-   **Normal Distribution**: Assumes that the data follows a normal distribution.

**Summary:**

-   **Population Variances**:

    -   **Z-Test**: Assumes known and equal population variances.

    -   **Welch's T-Test**: Does not assume equal variances and uses sample variances.

-   **Sample Size**:

    -   **Z-Test**: Typically used for large sample sizes.

    -   **Welch's T-Test**: Can be used for small or unequal sample sizes.
:::

## Paired-t test

What do we do when we have two samples, but they are not independent from each other? In this case we cannot use the classical two-sample $z$-test or two-sample t-test

We want to answer the question: Do husbands tend to be older than their wives?

| Husband's age | Wife's age | age difference |
|---------------|------------|----------------|
| 43            | 41         | 2              |
| 71            | 70         | 1              |
| 32            | 31         | 1              |
| 68            | 66         | 2              |
| 27            | 26         | 1              |

In a scenario like this, even if the samples were independent, the test would not give us a significant difference because the difference between the two pairs is always small, while the difference in the values in each sample (standard deviations) are large and what the two samples test does is to compare the differences to the fluctuation within each population.

In this case we will use the paired t-test. Our $H_0$ is that the population difference is 0. The formula for this test is:

$$
t=\frac{\bar{d}-0}{SE_{(d)}}
$$

where $\bar{d}$ is the average of the differences,

0 is the expected difference that under our null hypothesis is 0 and $SE_{\bar{(d)}}$ is the standard error for the difference:

$$
SE_{(d)}= \frac{\sigma_d}{\sqrt{n}} = \frac{s_d}{\sqrt{n}}
$$

In the example with our data:

$$
t=\frac{1.4}{\frac{0.55}{\sqrt{5}}}=5.69
$$

and now we have to use a table of a student t-distribution with 4 degrees of freedom to find the area under the curve of the normal distribution to the right of our t value.

we can use R code to calculate it like this:

```{r}
# Given values
t_value <- 5.69
degreesfreedom <- 4

# Calculate the p-value for a two-tailed test
p_value <- 2 * pt(-abs(t_value), degreesfreedom)
p_value

```

In this case our result means that we can reject the null hypothesis.

## The sign test

Image that we don't know the age difference, we only know if the husbands are older or not. We can follow here the same approach as with a binomial distribution, like the coin toss. In this specific case our null hypothesis $H_0$ is that half of the husbands are older than their wifes (no difference). We assign labels to the results, for example 1 if the husband is older than the wife and 0 otherwise.

$$
z=\frac{sum\ of\ 1s -\frac{n}{2}}{SE\ of\ sum} =\frac{sum\ of\ 1s -\frac{n}{2}}{\sqrt{n}\times\sigma_{H_0}}
$$

in our case we have 5 husbands being older than their wifes, so 5 1s and the standard deviation for our null hypothesis will be $\frac{1}{2}$ because we expect half the husbands to be older then their wives.

$$
z= \frac{5-\frac{5}{2}}{\sqrt{5}\frac{1}{2}}= 2.24
$$

now we can find the $p$ value using a table or software:

```{r}
z_score<- 2.24
# Calculate the p-value for a two-tailed test
p_value <- 2 * (1 - pnorm(z_score))
p_value
```

we can see that the result of this test is less significant than the test we did before, this is because we have less data to work with.

If we want to resolve the problem using software we can do the calculations:

```{r}
# Number of successes (husbands older than wives) 
successes <- 5  
# Total number of pairs 
n <- 5 
# Z-test (Sign test approximation) 
z_score <- (successes - 0.5 * n) / sqrt(0.25 * n) 
z_p_value <- 2 * (1 - pnorm(z_score)) 
z_p_value
```

but if we use the corresponding binomial test instead that would apply for this scenario, we get to quite a different result:

```{r}
# Number of successes (husbands older than wives) 
successes <- 5  
# Total number of pairs 
n <- 5  
# Perform the binomial test 
test_result <- binom.test(successes, n, p = 0.5, alternative = "two.sided")  

print(test_result)
```

::: {.callout-orange, appearance="simple", icon="false"}

Both the two-sample paired t-test and the sign test are used to compare paired data, but they are applied in different situations based on the assumptions and characteristics of the data.

**Two-Sample Paired T-Test**

**When to Use**:

-   **Normal Distribution**: The differences between the paired observations should be approximately normally distributed.

-   **Interval or Ratio Data**: The data should be measured on an interval or ratio scale.

-   **Parametric Test**: This test is parametric, meaning it relies on assumptions about the distribution of the data.

**Example**:

-   Comparing the blood pressure of patients before and after a treatment.

-   Measuring the weight of individuals before and after a diet program.

**Sign test**

1.  **Paired Observations**: When you have paired data (e.g., before and after measurements) and you want to test if there is a consistent difference between the pairs. For example, testing if a treatment has an effect by comparing measurements before and after the treatment.

2.  **Median Differences**: When you want to test if the median of differences between pairs is zero. This is useful when the data does not meet the assumptions required for parametric tests like the paired t-test. It does not rely on assumptions about the distribution of the data.

3.  **Non-Normal Data**: When the data does not follow a normal distribution, making parametric tests inappropriate. The sign test does not assume any specific distribution for the data.

4.  **Ordinal Data**: When the data is ordinal (ranked) rather than interval or ratio. The sign test can handle data that can only be compared as greater than, less than, or equal to.

[**Example Scenarios**]{.underline}

-   **Medical Studies**: Comparing the effectiveness of a treatment by measuring patient conditions before and after the treatment.

-   **Quality Control**: Testing if a new manufacturing process consistently produces better results than the old process.

-   **Behavioral Studies**: Comparing responses before and after an intervention. Comparing the number of days patients feel better before and after a new medication.

[**Summary:**]{.underline}

-   **Use a paired t-test** when the differences between paired observations are normally distributed and you have interval or ratio data.

-   **Use a sign test** when the data does not meet the normality assumption, is ordinal, or you prefer a non-parametric approach. :::

## Wilcoxon Rank Sum Test

We learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small $p$-value, while the Wilcoxon test does not:

```{r}
set.seed(779) ##779 picked for illustration purposes
N=25
x<- rnorm(N,0,1)
y<- rnorm(N,0,1)
```

Create outliers:

```{r}
x[1] <- 5
x[2] <- 7
cat("t-test pval:",t.test(x,y)$p.value)
cat("Wilcox test pval:",wilcox.test(x,y)$p.value)
```

The basic idea is to 1) combine all the data, 2) turn the values into ranks 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.

```{r rank-test-illustration, fig.cap="Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values ",fig.width=10.5,fig.height=5.25}
library(rafalib)
mypar(1,2)

stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)
abline(h=0)

xrank<-rank(c(x,y))[seq(along=x)]
yrank<-rank(c(x,y))[-seq(along=x)]

stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)

ws <- sapply(x,function(z) rank(c(z,y))[1]-1)
text( rep(1.05,length(ws)), xrank, ws, cex=0.8)
W <-sum(ws) 
```

`W` is the sum of the ranks for the first group relative to the second group. We can compute an exact $p$-value for $W$ based on combinatorics. We can also use the CLT since statistical theory tells us that this `W` is approximated by the normal distribution. We can construct a $z$-score as follows:

```{r}
n1<-length(x);n2<-length(y)
Z <- (mean(ws)-n2/2)/ sqrt(n2*(n1+n2+1)/12/n1)
print(Z)
```

Here the `Z` is not large enough to give us a $p$-value less than 0.05. These are part of the calculations performed by the R function `wilcox.test`.

we are not going to get into mathematical detail about these calculations, but the formula for the $z$ score here is $$
 z=\frac{U - \frac{n_2}{2}}{\sqrt{\frac{n_2(n_1+n_2+1)}{12n_1}}}
$$ and to perform this test in r we use: `wilcox.test(x,y)`

## Two-samples of binary data

Two observed proportions for a single binary variable can be compared directly using the **two-proportion** $z$-confidence interval and **two-proportion** $z$-test. These apply when there are at least 5 observations of each type in each of the two groups. The formulas are relatively simple. For example a 95% confidence interval for the difference between proportions is

$$
(p_2-p_1) = (\hat p_2-\hat p_1) \pm 1.96 \sqrt{\frac{\hat p_2(1-\hat p_2)}{n_2}+\frac{\hat p_1(1-\hat p_1)}{n_1}} 
$$

where $p_1$ and $p_2$ are the population proportions, $\hat p_1$ and $\hat p_2$ are the observed sample proportions and $n_1$ and $n_2$ are the sample sizes. R will use a improved version of this formula when computing the proportions.

In R we will use $prop.test(n_s , sample size)$ where $n_s$ is the number of successes.

::: exercise-block
Examples:

in the attrition dataset. *Are the attrition proportions different for the two departments?*

```{r}
kable(head(attrition1))%>%
  kable_styling(latex_options = "scale_down")%>%
  landscape()

table(attrition1$Department)
t<- table(attrition1$Department,
      attrition1$Attrition)
t
yes_counts<- as.numeric(t[,2])
sampleSize <-as.numeric(table(attrition1$Department))

testResult <- prop.test(yes_counts, sampleSize)
testResult
```

Our $p$-value `r testResult$p.value` is very low which means we can reject the default null hypothesis (the difference in the two samples is 0). It is also giving us the proportions for the samples in the two different departments: `r testResult$estimate` . And the confidence intervals is giving us the difference in the proportion of the two departments (as in prop1 - prop2). In this case being negative means that the second department has a higher attrition rate. The order of the variables is as entered in the vectors, so if *yes_counts* had *Research_Development* first and *Sales* second, prop1 will be for *Research_Development* and prop2 for *Sales*
:::

## Two-samples of categorical variables{#sec-chiSquare}

Two samples of a single categorical variable can be compared with the $x^2-test$ for homogeneity (Chi-squared). Under the hood, this is just a goodness-of-fit test of the hypothesis that the observed proportions in one of the samples are the same as those in the pooled sample.

### Testing homogeneity

The $\chi^2$ test of homogeneity test the null hypothesis that the distribution of a categorical variable is the same for several populations. It assumes that the samples are drawn independently within and across populations.

See how we can apply this logic to our Titanic survival data:

|          | First | Second | Third | Crew |
|----------|-------|--------|-------|------|
| Survived | 202   | 118    | 178   | 215  |
| Died     | 123   | 167    | 528   | 698  |

Note that in this case we are not sampling from a population. The data are not a random sample of the people on board, rather the data represent the whole population. Son in this case the chance process resulting in survival or death is not the sampling, but the result of random events occurring when looking for a way out of the ship, like getting into a life boat or into the water, being rescued from the water on time, etc. Then the 325 observations of first class passengers represent 325 independent draws from a probability histogram that gives a certain chance for survival. The 285 observations about second class passengers are drawn from the probability histogram for second class passengers, which may be different. The null hypothesis says that the probability of survival is the same for all four probability histograms. According to this hypothesis we can calculate the probability of survival by pooling all the data = $\frac{713}{2229}=32\%$ with this number we can calculate the expected number of surviving passengers for each class:

| Surviving | First | Second | Third | Crew  |
|-----------|-------|--------|-------|-------|
| Observed  | 202   | 118    | 178   | 215   |
| Expected  | 104.0 | 91.2   | 225.8 | 292.1 |

| Died     | First | Second | Third | Crew  |
|----------|-------|--------|-------|-------|
| Expected | 221.0 | 193.8  | 480.1 | 620.8 |
| Observed | 123   | 167    | 528   | 698   |

Now we can compare our *chi statistic* as we learned, using all the differences between expected and observed values:

$$
\chi^2 =\sum_{categories} \frac{(observed-expected)^2}{expected}
$$

$$
=\frac{(202-104)^2}{104}+\frac{(123-221)^2}{221}+\cdots =192 
$$

In this case our degrees of freedom are calculated like this: (4-1)\*(2-1) = 3 where 4 is for the number of categories, and 2 is for the two rows of results we are dealing with (surviving and died)

In our case our $p$ value will be extremely small, suggesting we should reject the null hypothesis that all ticket classes had the same possibility of survival.

```{r}
x <- 192.2
df <- 3

# Calculate the p-value
p_value <- pchisq(x, df, lower.tail = FALSE)

# Print the p-value
p_value

```

The test in `r`:

```{r}
# Create the contingency table
titanic_data <- matrix(c(202, 118, 178, 215, 123, 167, 528, 698), 
                       nrow = 2, 
                       byrow = TRUE,
                       dimnames = list(Survival = c("Survived", "Died"),
                                       Class = c("First", "Second", "Third", "Crew")))

# Print the contingency table
print(titanic_data)

# Perform the chi-squared test of homogeneity
chi_squared_test <- chisq.test(titanic_data)

# Print the test results
print(chi_squared_test)

```

::: exercise-box
Exercise:

*Use the substance_abuse data set to decide if the distribution of mental health diagnosis is the same for those with and without at least one psychiatric admission.*

```{r}
as.data.frame(report(substance_abuse$MHDx))
as.data.frame(report(substance_abuse$PsychAdmit))

substance_abuse <- substance_abuse %>% mutate(previousAdmit = ifelse(substance_abuse$PsychAdmit == 0, FALSE, TRUE))

t<- table( substance_abuse$previousAdmit,substance_abuse$MHDx)
t
chisq.test(t)
```

In this case according to our $p$-value we cannot say that patients that had at least one psychiatric admission in the past have the same proportion of diagnosis than patients without any admission.
:::

Chi squared test for homogeneity is not saying anything specific about any of these diagnosis, it is simply saying that the distribution of these diagnosis between those two groups (admitted vs not admitted) is not the same.

Is the distribution of SUDx the same for both men and women in our Substance Abuse data?

```{r}
head(substance_abuse)
as.data.frame(report(substance_abuse$SUDx))
```

```{r}
t <- table(substance_abuse$Gender, 
           substance_abuse$SUDx)
head(t)
```

Our null hypothesis is that the distribution between gender is the same.

```{r}
result <- chisq.test(t)
report(result)

```

In this case the $p$-value of the test is over 0.05 so that indicates that there is no significance difference by sex for the SUDx variable, so men and women are abusing the different categories of substances in the same proportions. The df in our results are the degrees of freedom that is the number of categories minus one.

In reality this is the same as doing a Goodness of fit test as we saw before, if we remember it was $chisq.test(p_s , p)$ where $p_s$ was the proportions of the different categories in our sample and $p$ the proportion of the population we wanted to test against. Back to our example what we are measuring here is the proportion of women or men against the proportion of the full sample, that will tell us if there is a difference between men and women, so we can also write the test like this getting similar results:

```{r}
sudxProp <- table(substance_abuse$SUDx)/sum(table(substance_abuse$SUDx))

women <- t[1,]

result<- chisq.test(women, p= sudxProp)
report(result)
```

### Independence Testing of Categorical Variables

We want to answer this question: Is gender (M/F) related to voting preference (liberal/conservative)? Now we have two categorical variables: gender and voting preference. The null hypothesis is that the two variables are independent. The alternative hypothesis is that there is some kind of association.

The tool we are going to use is the chi-square test ( $x^2$ ) for independence. This test can be used to test whether two categorical variables are independent or not. That is, whether knowledge about one provides information about the other. The math is exactly the same as for the chi-square test for homogeneity, hence, so is the sintax in most statistical packages, including R, but with a few things to bear in mind:

-   Technically, the null hypothesis of a $x^2$ test for independence is that in the larger population, the probability of an observation being in any specific pair of categories is equal to the product of the probabilities of being in each of the individual categories.

-   This is another omnibus test: it says nothing about particular categories.

-   Larger cell counts are better, as usual. Do not do this test if any of the counts are less than 5.

We are going to use our product_rating dataset to see the relationship between age groups and the product the user purchased.

first we can just see the contingency table for those two variables

```{r}
file <- here::here("data", "product_ratings.xlsx")
product_ratings <- read_excel(file)
t <- table(product_ratings$age,
           product_ratings$product)
t
```

```{r}
chisq.test(t)
```

The low $p$-value from this test is indicating that the theory that these two categorical variables are independent is implausible.

::: {.callout-orange appearance="simple" icon="false"}
Chi-Squared is used for testing the independence of categorical variables. It compares observed frequencies in a contingency table to expected frequencies under independence.
:::

::: exercise-box
Exercises

*Referring to the substance_abuse data set:*

-   *Is there any evidence of an association between Substance Use Diagnosis and Mental Health Diagnosis among patients receiving an intervention?*

```{r}
treatment <- filter(substance_abuse, 
                    Program == "Intervention")
t<- table (treatment$SUDx, treatment$MHDx)
chisq.test(t)
```

According to this result we cannot conclude that there is a relationship between these two variables.

-   *Is there any evidence of an association between SUDx and DLA_improvement among patients receiving an intervention?*

```{r, fig.align="center"}
ggplot(treatment, aes(x= SUDx, y = DLA_improvement))+
  geom_boxplot()

testResult <- aov( DLA_improvement ~ SUDx, data = treatment)
summary(testResult)
```

The difference is significant but not by large as the $p$ value shows that in 3% of the cases we could observe these data differences in a sample from the a population where their category means is the same.
:::

If we run a TukeyHSD test on these we can compare each pair individually and we find out that actually the only pair that shows a statistically significant difference is alcohol with opioids.

```{r}
TukeyHSD(testResult)
```

It is easy to confuse the testing for homogeneity and the testing for independence.

### Comparing the different uses of the chi-square test:

+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+
|                 | sample                                                  | research question                                                                   |
+=================+=========================================================+=====================================================================================+
| goodness of fit | single categorical variable. one sample                 | Are the counts of the different categories matching our expected results            |
+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+
| homogeneity     | single categorical variable measured on several samples | Are the groups homogeneous (have the same distribution of the categorical variable) |
+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+
| independence    | two categorical variables measured on one sample        | Are the two categorical variables independent?                                      |
+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+

Examples:

-   we want to know if there is more births in different week days (Monday, Tuesday...). We record the week day of 300 births. What test should we use to know if there is a difference per day? --\> We would use chi-square test for goodness of fit.

-   A food delivery start-up decides to advertise its service by placing ads on web pages. They wonder whether the percentage of viewers who click on the ad changes depending on how often the viewers were shown the ad. They randomly select 100 viewers from among those who were shown the add once, 135 from among those who were shown the add twice, and 150 from among those who were shown the ad three times. --\> We would use chi-square test of homogeneity.

-   An airline wants to find out whether there is a connection between the customer's status in its frequent flyer program and the class of ticket that the customer buys. It samples 1,000 ticket records at random and for each ticket notes the status level ('none', 'silver', 'gold') and the ticket class ('economy', 'business','first'). --\> chi-square test of independence

-   A county wants to check whether the racial composition of the teachers in the county corresponds to that of the population in the county. It samples 500 teachers at random and wants to compare that sample with the census numbers about the racial groups in that county. --\> chi-square test for goodness of fit

# Permutation Tests

Suppose we have a situation in which none of the standard mathematical statistical approximations apply. We have computed a summary statistic, such as the difference in mean, but do not have a useful approximation, such as that provided by the CLT. In practice, we do not have access to all values in the population so we can't perform a simulation. Permutation tests can be useful in these scenarios.

We are back to the scenario were we only have 10 measurements for each group.

```{r, fig.align='center'}
dat=read.csv("data/femaleMiceWeights.csv")

control <- filter(dat,Diet=="chow") %>% dplyr::select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% dplyr::select(Bodyweight) %>% unlist
obsdiff <- mean(treatment)-mean(control)
obsdiff
```

In previous sections, we showed parametric approaches that helped determine if the observed difference was significant. Permutation tests take advantage of the fact that if we randomly shuffle the cases and control labels, then the null is true. So we shuffle the cases and control labels and assume that the ensuing distribution approximates the null distribution. Here is how we generate a null distribution by shuffling the data 1,000 times:

```{r, fig.align='center'}
N <- 12
avgdiff <- replicate(1000, {
    all <- sample(c(control,treatment))
    newcontrols <- all[1:N]
    newtreatments <- all[(N+1):(2*N)]
  return(mean(newtreatments) - mean(newcontrols))
})
hist(avgdiff)
abline(v=obsdiff, col="red", lwd=2)
```

How many of the null means are bigger than the observed value? That proportion would be the $p$-value for the null. We add a 1 to the numerator and denominator to account for misestimation of the $p$-value. The $p$-value here is calculated directly as the percentage of values higher than our number of interest.

```{r}
#the proportion of permutations with larger difference
(sum(abs(avgdiff) > abs(obsdiff)) + 1) / (length(avgdiff) + 1)

```

Now let's repeat this experiment for a smaller dataset. We create a smaller dataset by sampling:

```{r, fig.align='center'}
N <- 5
control <- sample(control,N)
treatment <- sample(treatment,N)
obsdiff <- mean(treatment)- mean(control)
obsdiff

avgdiff <- replicate(1000, {
    all <- sample(c(control,treatment))
    newcontrols <- all[1:N]
    newtreatments <- all[(N+1):(2*N)]
  return(mean(newtreatments) - mean(newcontrols))
})
hist(avgdiff)
abline(v=obsdiff, col="red", lwd=2)
```

Now the observed difference is not significant using this approach. Keep in mind that there is no theoretical guarantee that the null distribution estimated from permutations approximates the actual null distribution. For example, if there is a real difference between the populations, some of the permutations will be unbalanced and will contain some samples that explain this difference. This implies that the null distribution created with permutations will have larger tails than the actual null distribution. This is why permutations result in conservative $p$-values. For this reason, when we have few samples, we can't do permutations.

Note also that permutations tests still have assumptions: samples are assumed to be independent and "exchangeable". If there is hidden structure in your data, then permutation tests can result in estimated null distributions that underestimate the size of tails because the permutations may destroy the existing structure in the original data.

# Multiple test corrections

The **Bonferroni correction** is a statistical method used to address the problem of multiple comparisons. When you perform multiple hypothesis tests, the chance of making a Type I error (false positive) increases. The Bonferroni correction helps control this by adjusting the significance level.

What we do is, if there are m test, we multiply the $p$-values by m. This correction makes sure that P(any of the m test rejects in error) is still smaller than 5%. The Bonferroni correction is often very restrictive. It guards against having even one false positive among the m test. As a consequence the adjusted $p$-values may not be significant any more even if a noticeable effect is present, so it will only work if you don't have a large number of test.

Alternatively if the number of tests is large, it is better to use the **False Discovery Proportion (FDP)**:

$$
FDP= \frac{number\ of\ false\ discoveries}{total\ number\ of\ discoveries}
$$

where a discovery occurs when a test rejects the null hypothesis. If no hypothesis are rejected, FDP is defined to be 0.

The **False Discovery Rate (FDR)** is the expected proportion of false discoveries among the discoveries. One common method to control the FDR is the Benjamini-Hochberg procedure. This method adjust the $p$-values to control the FDR at a desired alpha level. The procedure is like this:

1.  Sort the $p$-values in ascending order. $p_{(1)}, p_{(2)}, \ldots, p_{(m)}$

2.  Find the largest k value such as $p_k=\frac{k}{m}\alpha$

3.  Reject the null hypothesis for all $p_i$ where $i\leq k$

In r:

```{r}
# Example p-values
p_values <- c(0.01, 0.04, 0.03, 0.002, 0.05, 0.20)

# Apply the Benjamini-Hochberg procedure
p.adjusted <- p.adjust(p_values, method = "BH")

print(p.adjusted)

```

This will give you the adjusted $p$-values, and you can compare them to your desired FDR level to decide which hypotheses to reject.

The **validation set approach**. With this approach you split your data set into two parts, one is a model building set and a validation set before the analysis. You may use data snooping (multiple testing) in the first set of data to find something interesting. And then you test this hypothesis on the validation set..

# Choosing the right test for your data

Source: [Bevans, R. (2023, June 22). Choosing the Right Statistical Test Types & Examples. Scribbr. Retrieved November 18, 2024](https://www.scribbr.com/statistics/statistical-tests/)

### **Statistical assumptions**

Statistical tests make some common assumptions about the data they are testing:

1.  **Independence of observations** (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).

2.  **Homogeneity of variance**: the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test's effectiveness.

3.  **Normality of data**: the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data.

If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a **nonparametric statistical test**, which allows you to make comparisons without any assumptions about the data distribution.

If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).

### **Types of variables**

The types of variables you have usually determine what type of statistical test you can use.

**Quantitative variables** represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:

-   **Continuous** (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).

-   **Discrete** (aka integer variables): represent counts and usually can't be divided into units smaller than one (e.g. 1 tree).

**Categorical variables** represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:

-   **Ordinal**: represent data with an order (e.g. rankings).

-   **Nominal**: represent group names (e.g. brands or species names).

-   **Binary**: represent data with a yes/no or 1/0 outcome (e.g. win or lose).

Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment, these are the independent and dependent variables). Consult the tables below to see which test best matches your variables.

## **Choosing a parametric test: regression, comparison, or correlation**

Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

### **Regression tests**

Regression tests look for **cause-and-effect relationships**. They can be used to estimate the effect of one or more continuous variables on another variable.

### **Comparison tests**

Comparison tests look for **differences among group means**. They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).

+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
|                    | Predictor Variable     | Outcome Variable                       | Example                                                                                                              |
+====================+========================+========================================+======================================================================================================================+
| Paired t-test      | Categorical            | Quantitative                           | What is the effect of two different test prep programs on the average exam scores for students from the same class?\ |
|                    |                        |                                        | (Predictor: test program)                                                                                            |
|                    | 1 predictor            | Groups come from the same population   |                                                                                                                      |
+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| Independent t-test | Categorical            | Quantitative                           | What is the difference in average exam scores for students from two different schools?                               |
|                    |                        |                                        |                                                                                                                      |
|                    | 1 predictor            | Groups come from different populations | (Predictor: school A/B)                                                                                              |
+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| ANOVA              | Categorical            | Quantitative                           | What is the difference in average pain levels among post-surgical patients given three different treatments?         |
|                    |                        |                                        |                                                                                                                      |
|                    | 1 or more predictor    | 1 outcome                              |                                                                                                                      |
+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| MANOVA             | Categorical            | Quantitative                           | What is the effect of flower species on petal length, petal width and stem length?                                   |
|                    |                        |                                        |                                                                                                                      |
|                    | 1 or more predictor    | 2 or more outcomes                     |                                                                                                                      |
+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+
| Pearson's r        | 2 continuous variables | 2 continuous variables                 | How are latitude and temperature related?                                                                            |
+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+

## Flowchart for parametric tests

```{r, echo=FALSE}
library(DiagrammeR)

f<- DiagrammeR::grViz("
digraph flowchart {
  graph [layout = dot, rankdir = TB]  # TB stands for Top to Bottom
  
  node [shape = rectangle, style = filled, color = lightblue]
  A [label = 'Predictor variable: categorical or quantitative?', color = lightblue, fontcolor = black, fontsize = 12]
  B [label = 'Categorical', color = lightgrey, fontcolor = black, fontsize = 12]
  C [label = 'Quantitative', color = lightgray, fontcolor = black, fontsize = 12]
  D [label = 'Outcome variable: categorical or quantitative?', color = lightblue, fontcolor = black, fontsize = 12]
  E [label = 'Categorical',color = lightgray, fontcolor = black, fontsize = 12]
  F [label = 'Quantitative', color = lightgray, fontcolor = black, fontsize = 12]
  G [label = 'Choose a non-parametric test', color = lightpink, fontcolor = black, fontsize = 12]
  H [label = 'Do a comparison of means test', color = lightpink, fontcolor = black, fontsize = 12]
  I [label = 'How many groups are being compared?', color = lightblue, fontcolor = black, fontsize = 12]
  J [label = 'Two', color = lightgray, fontcolor = black, fontsize = 12]
  K [label = 'More than two', color = lightgray, fontcolor = black, fontsize = 12]
  L [label = 'How many outcome variables?', color = lightblue, fontcolor = black, fontsize = 12]
  M [label = 'One', color = lightgray, fontcolor = black, fontsize = 12]
  N [label = 'More than one', color = lightgray, fontcolor = black, fontsize = 12]
  O [label = 'T-test', color = lightpink, fontcolor = black, fontsize = 12]
  P [label = 'ANOVA', color = lightpink, fontcolor = black, fontsize = 12]
  Q [label = 'MANOVA', color = lightpink, fontcolor = black, fontsize = 12]
  R [label = 'Logistic regression', color = lightpink, fontcolor = black, fontsize = 12]
  S [label = 'How many predictor variables?', color = lightblue, fontcolor = black, fontsize = 12]
  T [label = 'One', color = lightgray, fontcolor = black, fontsize = 12]
  U [label = 'More than one', color = lightgray, fontcolor = black, fontsize = 12]
  V [label = 'Simple regression', color = lightpink, fontcolor = black, fontsize = 12]
  W [label = 'Multiple regression', color = lightpink, fontcolor = black, fontsize = 12]
  
  A -> B
  A -> C
  B -> D
  C -> S
  D -> E
  D -> F
  E -> G
  F -> H
  H -> I
  I -> J
  I -> K
  J -> O
  K -> L
  L -> M
  L -> N
  M -> P
  N -> Q
  C -> R
  S -> T
  S -> U
  T -> V
  U -> W
}
")
f
```

::: {.callout-orange appearance="simple" icon="false"}
### Parametric Tests

Parametric tests are statistical tests that make certain assumptions about the parameters of the population distribution from which the sample is drawn. These assumptions typically include:

-   **Normality**: The data follows a normal distribution.

-   **Homogeneity of variance**: The variances within each group being compared are similar.

-   **Independence**: The observations are independent of each other.

Because of these assumptions, parametric tests are generally more powerful and can make stronger inferences about the population. Some common parametric tests include:

-   **T-tests**: Used to compare the means of two groups.

-   **ANOVA (Analysis of Variance)**: Used to compare the means of three or more groups.

-   **Regression Analysis**: Used to examine the relationship between one or more predictor variables and an outcome variable.

### Non-Parametric Tests

Non-parametric tests, on the other hand, do not make strict assumptions about the population parameters. They are more flexible and can be used when the assumptions of parametric tests are not met, such as when the data does not follow a normal distribution or when sample sizes are small. Non-parametric tests are also useful for ordinal data or data with outliers.

Some common non-parametric tests include:

-   **Mann-Whitney U Test**: Used to compare the medians of two independent groups.

-   **Wilcoxon Signed-Rank Test**: Used to compare the medians of two related groups.

-   **Kruskal-Wallis Test**: Used to compare the medians of three or more independent groups.

-   **Spearman's Rank Correlation**: Used to assess the relationship between two ordinal variables.

### Key Differences

-   **Assumptions**: Parametric tests assume specific population parameters, while non-parametric tests do not.

-   **Data Type**: Parametric tests are typically used for interval or ratio data, whereas non-parametric tests can be used for ordinal data or data that does not meet parametric assumptions.

-   **Power**: Parametric tests are generally more powerful when their assumptions are met, meaning they are more likely to detect a true effect. Non-parametric tests are more robust and flexible but may be less powerful.

### Example Scenario

Imagine you want to compare the test scores of students from two different schools. If the test scores are normally distributed and the variances are similar, you would use a parametric test like the independent t-test. However, if the test scores are not normally distributed or have outliers, you might use a non-parametric test like the Mann-Whitney U Test.
:::

## Choosing a non parametric test

+---------------------------------+--------------------+----------------------------------------+---------------------+
|                                 | \                  | Outcome variable                       | Use in place of...  |
|                                 | Predictor variable |                                        |                     |
+=================================+:==================:+:======================================:+:===================:+
| Spearman's *r*                  | Quantitative       | Quantitative                           | Pearson's *r*       |
+---------------------------------+--------------------+----------------------------------------+---------------------+
| Chi square test of independence | Categorical        | Categorical                            | Pearson's *r*       |
+---------------------------------+--------------------+----------------------------------------+---------------------+
| Sign test                       | Categorical        | Quantitative                           | One-sample *t*-test |
+---------------------------------+--------------------+----------------------------------------+---------------------+
| Kruskal--Wallis *H*             | Categorical        | Quantitative                           | ANOVA               |
|                                 |                    |                                        |                     |
|                                 | 3 or more groups   |                                        |                     |
+---------------------------------+--------------------+----------------------------------------+---------------------+
| ANOSIM                          | Categorical        | Quantitative                           | MANOVA              |
|                                 |                    |                                        |                     |
|                                 | 3 or more groups   | 2 or more outcome variables            |                     |
+---------------------------------+--------------------+----------------------------------------+---------------------+
| Wilcoxon Rank-Sum test          | Categorical        | Quantitative                           | Independent t-test  |
|                                 |                    |                                        |                     |
|                                 | 2 groups           | groups come from different populations |                     |
+---------------------------------+--------------------+----------------------------------------+---------------------+
| Wilcoxon Signed-rank test       | Categorical        | Quantitative                           | Paired t-test       |
|                                 |                    |                                        |                     |
|                                 | 2 groups           | groups come from the same population   |                     |
+---------------------------------+--------------------+----------------------------------------+---------------------+